EPUB 是一种开放的行业标准电子书格式。然而,对 EPUB 及其许多功能的支持因阅读设备和应用程序而异。使用您的设备或应用程序设置根据您的喜好自定义演示文稿。您可以自定义的设置通常包括字体、字体大小、单列或双列、横向或纵向模式以及可以单击或点击放大的图形。有关阅读设备或应用程序的设置和功能的更多信息,请访问设备制造商的网站。
EPUB is an open, industry-standard format for e-books. However, support for EPUB and its many features varies across reading devices and applications. Use your device or app settings to customize the presentation to your liking. Settings that you can customize often include font, font size, single or double column, landscape or portrait mode, and figures that you can click or tap to enlarge. For additional information about the settings and features on your reading device or app, visit the device manufacturer’s Web site.
许多标题都包含编程代码或配置示例。要优化这些元素的呈现,请以单栏、横向模式查看电子书,并将字体大小调整为最小设置。除了以可重排文本格式呈现代码和配置之外,我们还提供了模仿印刷书中演示的代码图像;因此,在可回流格式可能会影响代码列表的呈现的情况下,您将看到“单击此处查看代码图像”链接。单击链接可查看打印保真度代码图像。要返回到上一个查看的页面,请单击设备或应用程序上的“后退”按钮。
Many titles include programming code or configuration examples. To optimize the presentation of these elements, view the e-book in single-column, landscape mode and adjust the font size to the smallest setting. In addition to presenting code and configurations in the reflowable text format, we have included images of the code that mimic the presentation found in the print book; therefore, where the reflowable format may compromise the presentation of the code listing, you will see a “Click here to view code image” link. Click the link to view the print-fidelity code image. To return to the previous page viewed, click the Back button on your device or app.
采用 SDN、服务虚拟化和服务链的下一代路由
Next-generation Routing with SDN, Service Virtualization, and Service Chaining
版权所有 © 2016 培生教育公司
Copyright © 2016 Pearson Education, Inc.
出版者:艾迪生韦斯利
Published by: Addison Wesley
版权所有。美国印刷。本出版物受版权保护,在进行任何禁止复制、存储在检索系统中或以任何形式或方式(电子、机械、影印、记录或类似方式)传输之前,必须获得出版商的许可。要获得使用本作品材料的许可,请向 Pearson Education, Inc., Permissions Department, 200 Old Tappan Road, Old Tappan, New Jersey 07675 提交书面请求,或者您可以将请求传真至 (201) 236-3290 。
All rights reserved. Printed in the United States of America. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. To obtain permission to use material from this work, please submit a written request to Pearson Education, Inc., Permissions Department, 200 Old Tappan Road, Old Tappan, New Jersey 07675, or you may fax your request to (201) 236-3290.
文本在美国印第安纳州克劳福兹维尔 RR Donnelley 的再生纸上印刷
Text printed in the United States on recycled paper at RR Donnelley, Crawfordsville, IN
首次印刷 2015 年 11 月
First Printing November 2015
美国国会图书馆出版编目号:2015950654
Library of Congress Cataloging-in-Publication Number: 2015950654
ISBN-13: 978-0-13-398935-9
ISBN-10: 0-13-398935-6
ISBN-13: 978-0-13-398935-9
ISBN-10: 0-13-398935-6
出版商
保罗·博格
Publisher
Paul Boger
副出版商
大卫·达西默
Associate Publisher
David Dusthimer
执行主编
布雷特·巴托
Executive Editor
Brett Bartow
高级开发编辑
克里斯托弗·克利夫兰
Senior Development Editor
Christopher Cleveland
总编辑
桑德拉·施罗德
Managing Editor
Sandra Schroeder
项目编辑
曼迪·弗兰克
Project Editor
Mandie Frank
文案编辑器
Cenveo® 出版商服务
Copy Editor
Cenveo® Publisher Services
技术编辑
Ignas Bagdonas
Jon Mitchell
Technical Editors
Ignas Bagdonas
Jon Mitchell
编辑助理
凡妮莎·埃文斯
Editorial Assistant
Vanessa Evans
封面设计师
艾伦·克莱门茨
Cover Designer
Alan Clements
作曲
Cenveo® 出版商服务
Composition
Cenveo® Publisher Services
索引器
Cenveo® 发布商服务
Indexer
Cenveo® Publisher Services
商标
Trademarks
制造商和销售商用来区分其产品的许多名称都被称为商标。如果这些名称出现在本书中,并且出版商知道商标声明,则这些名称均以首字母大写或全部大写印刷。
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals.
警告和免责声明
Warning and Disclaimer
作者和出版商在本书的准备过程中非常谨慎,但没有做出任何形式的明示或暗示的保证,并且对错误或遗漏不承担任何责任。对于因使用此处包含的信息或程序而产生的偶然或间接损害,我们不承担任何责任。
The author and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein.
特别销售
Special Sales
有关批量购买本书的信息,或特殊销售机会(可能包括电子版本;定制封面设计;以及针对您的业务、培训目标、营销重点或品牌利益的特定内容),请联系我们的公司销售部门请发送电子邮件至corpsales@pearsoned.com或 (800) 382-3419。
For information about buying this title in bulk quantities, or for special sales opportunities (which may include electronic versions; custom cover designs; and content particular to your business, training goals, marketing focus, or branding interests), please contact our corporate sales department at corpsales@pearsoned.com or (800) 382-3419.
对于政府销售查询,请联系governmentsales@pearsoned.com。
For government sales inquiries, please contact governmentsales@pearsoned.com.
有关美国境外销售的问题,请联系International@pearsoned.com。
For questions about sales outside the U.S., please contact international@pearsoned.com.
请访问我们的网站:informit.com/aw
Visit us on the Web: informit.com/aw
拉斯·怀特他的网络工程职业生涯始于在美国空军安装终端仿真卡和反向多路复用器。1996 年,他搬到北卡罗来纳州罗利,加入 Cisco Systems 的技术支持中心 (TAC) 路由协议团队。Russ 从 TAC 调到全球升级团队,然后进入工程部门,最后以杰出架构师的身份进入销售部门。他目前是一名网络架构师,从事网络复杂性和大规模设计领域的工作,是 IETF 路由区域理事会的成员、活跃的演讲者和作家,并活跃于互联网协会。他拥有 CCIE #2637、CCDE 2007:001、CCAr、卡佩拉大学信息技术硕士学位和牧羊人神学院基督教事工硕士学位。他与妻子和两个孩子住在北卡罗来纳州橡树岛,
Russ White began his network engineering career installing terminal emulation cards and inverse multiplexers in the United States Air Force. In 1996, he moved to Raleigh, N.C., to join Cisco Systems in the Technical Assistance Center (TAC) routing protocols team. From TAC, Russ moved to the global escalation team, and then into engineering, and finally into sales as a Distinguished Architect. He is currently a network architect working in the area of network complexity and large scale design, a member of the IETF Routing Area Directorate, an active speaker and writer, and active in the Internet Society. He holds CCIE #2637, CCDE 2007:001, the CCAr, a Masters in Information Technology from Capella University, and a Masters in Christian Ministry from Shepherds Theological Seminary. He lives in Oak Island, N.C., with his wife and two children, and is currently a P.h.D student at Southeastern Baptist Theological Seminary.
Jeff Tantsura于 20 世纪 90 年代初在一家小型 ISP 开始了他的网络工程职业生涯,担任系统/网络管理员,后来在较大的 ISP 工作,负责网络设计和架构、供应商选择。
Jeff Tantsura started his network engineering career in early 1990s at a small ISP as system/network administrator, later working for bigger ISPs where he was responsible for network design and architecture, vendor selection.
目前,Jeff 领导爱立信技术战略路由部门并担任 IETF 路由工作组主席。
Currently Jeff is heading Technology Strategy Routing at Ericsson as well as chairing IETF Routing Working Group.
Jeff 拥有佐治亚大学计算机科学和系统分析硕士学位以及伯克利哈斯商学院颁发的卓越商业执行证书。
Jeff holds MSc in Computer Science and Systems Analysis from University of Georgia and Executive Certificate of Business Excellence from Haas School of Business, Berkeley.
他还拥有 CCIE R&S #11416 和爱立信认证专家 IP 网络#8
He also holds CCIE R&S #11416 and Ericsson Certified Expert IP Networking #8
杰夫与妻子和最小的孩子住在加利福尼亚州帕洛阿尔托。
Jeff lives in Palo Alto, CA, with his wife and youngest child.
伊格纳斯·巴格多纳斯
Ignas Bagdonas
Ignas Bagdonas 在过去二十年里一直致力于网络工程领域,涵盖运营、部署、设计、架构、开发和标准化等方面。他曾在全球多个大型 SP 和企业网络工作,参与了许多世界上首次的技术部署,并通过会议、研讨会和讲习班参与建立社区意识。他目前的重点涵盖端到端网络架构演进和新兴技术。Ignas 拥有 Cisco CCDE 和 CCIE 认证。
Ignas Bagdonas has been involved in the network engineering field for last two decades, covering operations, deployment, design, architecture, development, and standardization aspects. He has worked on multiple large SP and enterprise networks worldwide, participated in many of the world’s first technology deployments, and has been involved with building community awareness via conferences, seminars, and workshops. His current focus covers end-to-end network architecture evolution and new emerging technologies. Ignas holds Cisco CCDE and CCIE certifications.
乔恩·米切尔
Jon Mitchell
Jon Mitchell,CCIE 编号 15953,是 Google 技术基础设施组织的网络工程师,负责其全球骨干网的工作。在加入 Google 之前,Jon 在网络行业工作了 15 年,曾在 Microsoft 担任过网络架构师、Cisco Systems 担任过系统工程、AOL 担任过网络架构和工程,并在 Loudcloud 担任过网络工程。他还是各个路由区域工作组的 IETF 积极参与者。通过所有这些角色,乔恩始终热衷于解决大规模问题并通过简化和自动化解决它们。当乔恩不考虑社交时,他喜欢许多其他爱好,例如徒步旅行、跑步、支持清洁水、微型啤酒厂以及与妻子和四个孩子共度时光。
Jon Mitchell, CCIE No. 15953, is a network engineer in Google’s Technical Infrastructure organization where he works on their global backbone. Prior to Google, Jon has worked in roles of network architecture at Microsoft, systems engineering at Cisco Systems, network architecture and engineering at AOL, and network engineering at Loudcloud for the last 15 years that he has been in the networking industry. He is also an active IETF participant in various routing area working groups. Through all of these roles, Jon has always had a passion for working on large-scale problems and solving them through simplification and automation. When Jon is not thinking about networking, he enjoys many other passions such as hiking, running, supporting clean water, microbreweries, and spending time with his wife and four children.
拉斯·怀特:这本书献给贝卡
和汉娜。感谢您与
脾气暴躁的老爸爸同甘共苦。
Russ White: This book is dedicated to Bekah
and Hannah. Thank you for sticking with your
grumpy old dad through thick and thin.
Jeff Tantsura:这本书献给我的
家人:Marina、Ilia、Miriam 和 Davy——感谢
你们的支持!
Jeff Tantsura: This book is dedicated to my
family: Marina, Ilia, Miriam, and Davy—thank
you for your support!
拉斯·怀特:
Russ White:
我要感谢这些年来教会我网络的许多人,包括 Denise Fishburne、Don Slice、Alvaro Retana、Robert Raszuk 以及许多其他人,数量太多,无法一一列举。我还要感谢道格·布克曼博士驱使我成为一个更好的思想家,感谢威尔·科伯利博士驱使我成为一个更好的作家,感谢拉里·佩特格鲁博士驱使我成为一个更好的人。最后,我要感谢格雷格·费罗和伊森·班克斯激励我重新开始写作。
I would like to thank the many people who have taught me networking through the years, including Denise Fishburne, Don Slice, Alvaro Retana, Robert Raszuk, and a host of others—too many to name. I would also like to thank Dr. Doug Bookman for driving me to be a better thinker, Dr. Will Coberly for driving me to be a better writer, and Dr. Larry Pettegrew for driving me just to be a better person. Finally, I’d like to thank Greg Ferro and Ethan Banks for inspiring me to start writing again.
杰夫·坦苏拉:
Jeff Tantsura:
我要感谢这些年来教导和帮助我更好地理解网络的人:Acee Lindem、Tony Przygienda、Tony Li、Jakob Heitz 和许多其他人,他们太多了,无法提供帮助。
I would like to thank the people who taught and helped me through the years to understand networking better: Acee Lindem, Tony Przygienda, Tony Li, Jakob Heitz and many others, too many to help.
特别感谢我的合著者 Russ 给我的启发!
Special thanks to my co-author Russ for inspiring me!
Chapter 1: Defining Complexity
Chapter 2: Components of Complexity
Chapter 3: Measuring Network Complexity
Chapter 4: Operational Complexity
Chapter 6: Managing Design Complexity
Chapter 7: Protocol Complexity
Chapter 8: How Complex Systems Fail
Chapter 9: Programmable Networks
Chapter 10: Programmable Network Complexity
Chapter 11: Service Virtualization and Service Chaining
Chapter 12: Virtualization and Complexity
Chapter 1: Defining Complexity
Anything for Which There Is More State Than Required to Achieve a Goal
Future Extensions versus New Protocols
Why Not Build Infinitely Complex Systems?
Quick, Cheap, and High Quality: Choose Two
Consistency, Availability, and Partition Tolerance: Choose Two
Journey into the Center of Complexity
Chapter 2: Components of Complexity
Distance Vector: An EIGRP Example
Link State: OSPF and IS-IS Convergence
An Example of State Failure in the Real World
The Network That Never Converges
Chapter 3: Measuring Network Complexity
Some Measures of Network Complexity
Chapter 4: Operational Complexity
The Cost of Human Interaction with the System
Policy Dispersion versus Optimal Traffic Handling
Solving the Management Complexity Problem
Automation as a Solution to Management Complexity
Modularity as a Solution to Management Complexity
Protocol Complexity versus Management Complexity
Control Plane State versus Stretch
State versus Stretch: Some Final Thoughts
Topology versus Speed of Convergence
Topology versus Speed of Convergence: Some Final Thoughts
Fast Convergence versus Complexity
Improving Convergence with Intelligent Timers: Talk Faster
Removing Timers from Convergence: Precompute
Working around Topology: Tunneling to the Loop-Free Alternate
Some Final Thoughts on Fast Convergence
Virtualization versus Design Complexity
Chapter 6: Managing Design Complexity
How Modularity Attacks the Complexity Problem
Failure Domains and Information Hiding
Final Thoughts on Information Hiding
Chapter 7: Protocol Complexity
Flexibility versus Complexity: OSPF versus IS-IS
Layering versus Protocol Complexity
Protocol Complexity versus Design Complexity
EIGRP and the Design Conundrum
Final Thoughts on Protocol Complexity
Chapter 8: How Complex Systems Fail
Positive Feedback Loops in Network Engineering
Speed, State, and Surface: Stability in the Network Control Plane
TCP Synchronization as a Shared Fate Problem
Thoughts on Root Cause Analysis
Engineering Skills and Failure Management
Chapter 9: Programmable Networks
The Ebb and Flow of Centralization
Defining Network Programmability
Use Cases for Programmable Networks
Programmable Network Interfaces
The Programmable Network Landscape
Path Computation Element Protocol
Interface to the Routing System
Chapter 10: Programmable Network Complexity
Surface and the Programmable Network
Application to Control Plane Failure Domain
Controller to Controller Failure Domain
Final Thoughts on Failure Domains
Chapter 11: Service Virtualization and Service Chaining
Network Function Virtualization
Chapter 12: Virtualization and Complexity
Policy Dispersion and Network Virtualization
Surface and Policy Interaction
Chapter 13: Complexity and the Cloud
Where Does the Complexity Live?
Large Numbers of Interacting Parts
What Makes Something “Too Complex”?
Managing Complexity in the Real World
每个工程师,无论他们从事什么类型的工程,几乎总是面临着复杂性。更快、更便宜、更高质量的三合一是工程师所做的一切的不变伴侣。但有时,复杂性并不那么明显。例如,软件项目需要多长时间?常见的回答是两个小时、两周或太长——又一次遭遇复杂性。
Every engineer, no matter what type of engineering they do, face complexity almost constantly. The faster, cheaper, higher quality triad are the constant companion of everything the engineer does. Sometimes, though, the complexity isn’t so obvious. For instance, how long will the software project take? Two hours, two weeks, or too long is the common reply—another encounter with complexity.
虽然复杂性理论的研究在科学和数学领域进展迅速,但这些理论和思想在更实际的工程问题中的应用,特别是以“普通工程师”可以阅读和理解的方式,根本没有跟上步伐。本书旨在填补网络工程的这一空白。
While research into complexity theory has proceeded apace in the scientific and mathematical worlds, the application of the theories and ideas to more practical engineering problems, particularly in a way that the “average engineer” can read and understand, simply hasn’t kept pace. This book aims to fill that gap for network engineering.
这本书缺乏复杂性理论家正在研究的优雅的数学模型,这可能会让拥有数学学位的读者感到失望。相反,重点是复杂性理论结构的实际应用。本书不关注“纯理论”和数学,而是关注复杂性理论家正在研究的想法的实际应用。
In a move that will probably be disappointing to readers with a math degree, this book is devoid of the elegant mathematical models complexity theorists are working with. Rather, the focus is on the practical application of the theoretical constructs of complexity. Instead of being focused on the “pure theory” and math, this book is focused on the practical application of the ideas being investigated by complexity theorists.
因此,本书的目标读者是“普通”网络工程师,他们希望了解为什么特定的常见结构会以它们的方式工作,例如分层设计、聚合和协议分层。通过深入了解这些被广泛接受的网络设计方法,揭示每种情况下系统中设计的权衡原因,作者希望读者学会吸取在一个领域学到的经验教训并将其应用到其他领域。读完本书后,网络工程师应该开始理解为什么分层设计和分层等能够发挥作用,以及如何在更具体的情况下查看和更改权衡。
This book, then, is targeted at the “average” network engineer who wants to gain an understanding of why particular common constructs work the way they do, such as hierarchical design, aggregation, and protocol layering. By getting behind these widely accepted ways of designing a network, exposing the reasons for the tradeoffs designed into the system in each case, the authors hope the reader learns to take lessons learned in one area and apply them to other areas. After reading this book, network engineers should begin to understand why hierarchical design and layering, for instance, work, and how to see and change the tradeoffs in more specific situations.
本书在第一章中从复杂性理论的高层次观点开始。这不是深入的理论,而是旨在提供实践视图无需深入研究繁重的数学(或任何数学)即可了解复杂性。第二章概述了测量网络复杂性的各种尝试,包括每次尝试中遇到的一些问题。第三章提出了一个复杂性模型,该模型将在本书的其余部分中使用。
This book begins, in the first chapter, with a high level view of complexity theory. This is not deep theory, but it is rather designed to provide a hands-on view of complexity without diving into heavy math (or any math at all). The second chapter provides an overview of various attempts at measuring complexity in a network, including some of the problems plauging each attempt. The third chapter proposes a model of complexity that will be used throughout the rest of the book.
第 4 章到第 7 章考虑了网络和协议设计的复杂性示例。这些章节试图阐明的要点是前三章中的模型和概念是如何实现的。第 8 章是进入网络工程术语中的复杂系统故障世界的重要途径。本书的其余部分首先概述了网络工程师在撰写本文时面临的三种新技术,然后从复杂性权衡方面对每一种技术进行了分析。
Chapter 4 through 7 consider examples of complexity in network and protocol design. The point these chapters attempt to drive home is how the models and concepts in the first three chapters. Chapter 8 is something of an important detour into the world of complex system failure in network engineering terms. The rest of the book is dedicated to first providing an overview of three new technologies network engineers face at the time of writing, and then an analysis of each one in terms of complexity tradeoffs.
请注意,以 Draft-xxx 形式引用的文档是 IETF 正在进行的工作,因此没有稳定的文档名称或统一资源定位符 (URL)。由于这些限制,没有给出明确的参考文献,而只是给出了文档的标题和文档名称。要查找这些文档,请在 IETF 网站 ( ietf.org )上执行文档搜索。
Note, Documents referenced with the form draft-xxx are IETF works in progress, and therefore do not have stable document names or uniform resource locators (URLs). Because of these limitations, explicit references have not been given, but rather just the title of the document and the document name. To find these documents, please perform a document search at the IETF website (ietf.org).
计算机网络很复杂。
Computer networks are complex.
但“计算机网络很复杂”是什么意思呢?你能把一个网络放在一个秤上并让指针指向“复杂”吗?是否有一个数学模型可以将一组网络设备的配置和拓扑插入其中,从而产生“复杂性指数”?规模、弹性、脆弱性和优雅等概念与复杂性有何关系?不幸的是,这些问题的答案很复杂。事实上,回答这些问题最困难的问题是决定从哪里开始。
But what does “computer networks are complex” mean? Can you put a network on a scale and have the needle point to “complex”? Is there a mathematical model into which you can plug the configurations and topology of a set of network devices that will, in turn, produce a “complexity index”? How do the concepts of scale, resilience, brittleness, and elegance, relate to complexity? The answers to these questions are—unfortunately—complex. In fact, the most difficult issue involved in answering these questions is deciding where to begin.
最好的起点是从头开始——在这种情况下,复杂性的一些定义,从“我不理解的一切”到“涉及很多无意后果的事情”。这些答案至少都有一定的道理。每个答案的某些部分都有助于构建总体复杂性以及网络设计和架构的复杂性的图景。
The best place to begin is at the beginning—in this case a few definitions of complexity, from “everything I don’t understand,” to “that which involves a lot of unintentional consequences.” There is at least some truth in each of these answers; some part of each of these answers is helpful in building a picture of complexity in general, and complexity in network design and architecture.
一旦审视了复杂性的含义,本书将转而探讨为什么计算机网络首先必须是复杂的。如果工程师避免了从协议设计到网络管理的所有复杂性,事情不是会简单得多吗?难道不能以某种方式“管理”复杂性吗?本章第二节概述了复杂性领域的研究。事实证明,复杂性是现实世界中必要的权衡。要达到这一目标,需要通过一些相当——请原谅双关语——复杂的材料。有必要研究复杂性的组成部分,并将一个广泛的想法分解为一组实际上可以以有用的方式理解的组成部分。
Once the meaning of complexity has been examined, this book will turn to asking why computer networks must be complex in the first place. Wouldn’t things be a lot simpler if engineers just avoided all complexity from protocol design to network management? Can’t complexity be “managed out” in some way? The second section of this chapter provides an overview of research in the field of complexity. As it turns out, complexity is a necessary tradeoff in the real world. Getting there will require winding through some rather—pardon the pun—complex material. It’s going to be necessary to look at the components of complexity and to dissect a broad idea into a set of components that can actually be understood in a useful way.
第 2 章“复杂性的组成部分”将开始研究复杂性与网络工程之间的交叉点,考虑各种因素对复杂性的反应。网络工程师往往对复杂性有五种反应之一,其中三种可以形成积极的反应,两种通常是破坏性的。三种积极回应如下:
Chapter 2, “Components of Complexity,” will begin investigating the intersection between complexity and network engineering, considering various reactions to complexity. Network engineers tend to have one of five reactions to complexity, three of which can be shaped into positive responses and two are generally destructive. The three positive responses are as follows:
1.抽象掉复杂性,围绕系统的每个部分构建一个黑匣子,以便更容易理解每个部分以及这些部分之间的交互。
1. Abstract the complexity away, to build a black box around each part of the system, so each piece and the interactions between these pieces are more immediately understandable.
2.将复杂性抛到隔间墙上——将问题从网络领域转移到应用程序、编码或协议领域。正如 RFC1925 所说,“转移问题(例如,将问题转移到整个网络架构的不同部分)比解决问题更容易。” 1
2. Toss the complexity over the cubicle wall—to move the problem out of the networking realm into the realm of applications, or coding, or a protocol. As RFC1925 says, “It is easier to move a problem around (e.g., by moving the problem to a different part of the overall network architecture) than it is to solve it.”1
1 . Ross Callon 主编,“十二个网络真理”(IETF,1996 年 4 月),https://www.rfc-editor.org/rfc/rfc1925.txt。
1. Ross Callon, ed., “The Twelve Networking Truths” (IETF, April 1996), https://www.rfc-editor.org/rfc/rfc1925.txt.
3.在顶部添加另一层,通过在已有协议或隧道之上放置另一个协议或隧道,将所有复杂性视为黑匣子。回到 RFC1925,“总是可以添加另一个间接级别。” 2
3. Add another layer on top, to treat all the complexity as a black box by putting another protocol or tunnel on top of what’s already there. Returning to RFC1925, “It is always possible to add another level of indirection.”2
2 . 同上。
2. Ibid.
两种普遍否定的反应如下:
The two generally negative responses are as follows:
1.对复杂性感到不知所措,将现有的东西贴上“遗留”的标签,并追逐一些新的闪亮的东西,这些东西将以一种被认为不那么复杂的方式解决所有问题。
1. Become overwhelmed with the complexity, label what exists as “legacy,” and chase some new shiny thing that will solve all the problems in what is perceived as a much less complex way.
2.忽略问题并希望它会消失。一个很好的例子是,主张“仅此一次”例外,以便在非常紧张的时间表内实现特定的业务目标或解决某些问题,并承诺“稍后”处理复杂性问题。
2. Ignoring the problem and hoping it will go away. Arguing for an exception “just this once,” so a particular business goal can be met, or some problem fixed, within a very tight schedule, with the promise that the complexity issue will be dealt with “later,” is a good example.
这些反应表现为网络设计许多不同领域的复杂性的解决方案,包括操作复杂性、设计复杂性和协议复杂性;这些领域中的每一个领域都将在单独的章节中进行更详细的探讨,以便更好地理解增加每个领域的复杂性的成本和收益。
These reactions show up as solutions to complexity in many different realms of network design, including operational complexity, design complexity, and protocol complexity; each of these areas will be examined in more detail in individual chapters to gain a better understanding of the costs and benefits of adding complexity in each one.
一旦涵盖了所有这些背景材料,就该转向一些更实际的示例了。不同技术背后的基本操作和/或概念,有望帮助驯服我们网络中的复杂性怪物将予以考虑,包括检查每一项在何处减少和增加了复杂性。
Once all this background material is covered, it will be time to turn to some more practical examples. The basic operation and/or concepts behind different technologies that promise to help tame the complexity monster in our networks will be considered, including an examination of where each one reduces and increases complexity.
最后一章将把所有这些结合在一起。最终目标是建立网络系统复杂性的通用视图,可用于做出有关操作、设计和协议的实际决策。虽然从这里到那里需要一些理论,但最终目标迫在眉睫:学习如何识别和管理现实计算机网络世界中的复杂性。
The final chapter will bring it all together. The final goal is to build a generalized view of complexity in network systems that can be applied to making real decisions about operations, design, and protocols. While it will take some theory to get from here to there, the final goal is imminently practical: to learn how to recognize and mange complexity in the real world of computer networks.
当面对“什么是复杂性”这个问题时,您可能会回答:“复杂性就像情人眼里出西施一样。” 然而,接受这个定义比较困难,因为它内化了复杂性。将复杂性变成一种内在的心态或印象,就会使复杂性失去相应的现实,从而使工程师没有工具或模型来理解、控制或管理复杂性。另一方面,面对复杂性的感知具有真正的价值,因为这种感知可以帮助您看到并理解网络中复杂性的现实。
When confronted with the question, “what is complexity,” you might reply, “Complexity—like beauty—is in the eye of the beholder.” It’s harder to live with this definition, however, because it internalizes complexity. Making complexity into an internal state of mind, or an impression, leaves complexity with no corresponding reality, and hence leaves engineers with no tools or models with which to understand, control, or manage complexity. On the other hand, there is real value in confronting the perception of complexity, as the perception can help you see, and understand, the reality of complexity in networks.
一个有用的起点是用两个广泛的观念来定义复杂性:复杂性是我不理解的任何东西,而复杂性是具有大量(移动)部分的任何东西。除了这些之外,重要的是要考虑复杂性作为状态与意图,最后,复杂性和意外后果的规律。
A useful place to begin is defining complexity with two broad perceptions: complexity is anything I don’t understand, and complexity is anything with a lot of (moving) parts. Moving beyond these, it’s important to consider complexity as state versus intent, and then, finally, complexity and the law of unintended consequences.
对于一个人来说复杂的事情对于另一个人来说可能很简单。或者,正如克拉克第三定律所说,“任何足够先进的技术都与魔法没有区别。” 3从这些观察中,您可以得出结论:复杂性就是您不理解的任何事物。从编码员的角度来看,一些定义说明了这种心态:
What is complex for one person might be simple for another. Or, as Clarke’s third law says, “Any sufficiently advanced technology is indistinguishable from magic.”3 From these observations, you could conclude that complexity is anything you don’t understand. A few definitions from a coder’s point of view illustrate this state of mind:
3 . 亚瑟·克拉克 (Arthur C. Clarke),《未来概况:可能性极限的探究》(伦敦:菲尼克斯,2000 年),np
3. Arthur C. Clarke, Profiles of the Future: An Inquiry into the Limits of the Possible (London: Phoenix, 2000), n.p.
• 干净的解决方案是有效且我理解的解决方案。
• A clean solution is a solution that works, and that I understand.
• 复杂的解决方案是指可行但我不理解的解决方案。
• A complex solution is a solution that works, and that I don’t understand.
• 模糊代码是我不理解且没有注释的代码。
• Obscure code is code that I don’t understand, and isn’t commented.
• Self-documenting code is code that I wrote, and therefore I can understand without comments.
• hack 是一段不起作用的代码,不是我写的,我也不理解。
• A hack is a piece of code that doesn’t work, I didn’t write, and I don’t understand.
• 临时解决方法是我确实编写并且理解的一段代码。
• A temporary workaround is a piece of code that I did write and I do understand.
但“任何我不明白的事情”在现实世界中是可行的定义吗?有几点应该说明以此作为复杂性的最终定义的谬误:
But is “anything I don’t understand” a workable definition in the real world? Several points should illustrate the fallacy of settling on this as a final definition of complexity:
• 现实世界中有许多复杂的系统,没有人能够完全理解。事实上,现实世界中有许多网络没有人能够完全理解。例如,网络运营商可能了解互联网的一个特定部分,或者了解互联网实际工作原理的大致轮廓,但如果有人声称他们知道互联网的每个部分和各个部分是如何工作的很荒谬。
• There are a number of complex systems in the real world that no one understands in their entirety. In fact, there are a number of networks in the real world that no one understands in their entirety. It might, for instance, be possible for a network operator to understand one specific piece of the Internet, or a general outline of how the Internet really works, but for someone to claim that they know how every part and piece of the Internet works would be absurd.
• 如果这个定义成立的话,加深理解必然意味着降低复杂性。对于任何给定的现象,一旦用纯粹唯物主义的术语“理解”它,就可以宣称它“并不复杂”。例如,在 20 世纪 00 年代初,没有人了解眼睛等生物系统的复杂运作。从那时起,人们就开始研究和记录眼睛的运作(至少在某种程度上)。这是否意味着眼睛现在不像以前那么复杂了?
• Increased understanding necessarily means reduced complexity if this definition is true. For any given phenomenon, once it is “understood” in purely materialistic terms, it can be declared “not complex.” For instance, no one in the early 1900s understood the complex operation of a biological system such as the eye. Since that time, the operation of the eye has been researched and documented (at least to some degree). Does this mean the eye is now less complex than it once was?
• 同样,了解设计者的意图可能会使设计看起来不那么复杂,而无需实际更改设计本身。例如,在协议中使用类型-长度-值 (TLV) 结构似乎会增加整体复杂性,但几乎没有什么好处,但是当考虑到针对新功能的协议的未来验证时,使用更复杂的结构可能是完全有意义的。
• In the same way, understanding the intent of a designer might make a design appear less complex without actually changing the design itself. For instance, using Type-Length-Value (TLV) constructions in a protocol might seem to increase the overall complexity for very little gain—but when future proofing a protocol against new functionality is taken into account, using more complex structures might make perfect sense.
简而言之,复杂性和理解之间的联系充其量是微弱的——更好地理解任何给定的系统并不会真正降低它的复杂性,只会让它更容易理解。不过,不要低估这个定义的基本前提。为什么更好地理解系统会使其显得不那么复杂?几点可以在这里提到,其中每一个都将在本书的其余部分得到更全面的发展:
In short, the connection between complexity and understanding is tenuous at best—gaining a better understanding of any given system doesn’t really make it less complex, it just makes it more understandable. Don’t underestimate the underlying premise of this definition, though. Why does gaining a better understanding of a system make it appear less complex? Several points can be mentioned here, each of which will be developed more fully through the rest of this book:
• 了解系统如何工作可以让您理解并解释各个部分之间的相互作用。那么,您可以用来管理复杂性的步骤之一就是发现系统各部分之间的交互。
• Understanding how a system works allows you to comprehend and account for the interactions between the various parts. One of the steps you can use to manage complexity, then, is to discover the interactions between the parts of the system.
• 了解系统的工作原理使您能够根据一组给定的输入来预测输出,或者更确切地说,根据输入构建系统输出的心理模型。管理复杂性的另一个步骤是将任何给定的系统抽象为一组输入和输出,这些输入和输出可以用作更大上下文(或系统)中该系统的代理。
• Understanding how a system works allows you to predict the output based on a set of given inputs, or rather to construct a mental model of the system’s output based on inputs. Another step you can use to manage complexity is to abstract any given system into a set of inputs and outputs that can be used as a proxy for that system within a larger context (or system).
• 系统的心智模型还可以让您识别系统何时无法正常运行,并解决这些计划外的输出、发现系统为何如此运行并修复故障,或确定输入为何不正常运行。不符合您的期望(并纠正它们)。
• A mental model of the system also allows you to recognize when the system isn’t operating correctly, and to work around these unplanned outputs, discover why the system is acting the way it is and repair the fault, or determine why the inputs don’t match your expectations (and correct them).
虽然将复杂性定义为“我不理解的任何事情”并不是一个完整的定义,但它确实有助于在一些非常重要的方面界定定义的范围。探索被认为“难以理解”的事物可以带来更全面的定义——一个在现实世界中更有用的定义。
While defining complexity as “anything I don’t understand” isn’t a full definition, it does help scope the definition in some very important ways. Exploring things perceived as being “hard to understand” can lead to a fuller definition—a definition that will be more useful in the real world.
在设计大规模解决方案时,一个反复出现的问题是“使用尽可能少的移动部件”。一般来说,任何具有许多部分或许多不同部分的事物都被视为复杂的,从大规模网络到复杂的晶体结构。但是零件的数量,或者移动零件的数量,甚至不同零件的数量,真的可以代表复杂性吗?不必要。两个具体例子反驳了“移动部件的数量是复杂性的良好决定因素”这一观点。
In designing large-scale solutions, one recurring refrain is to “use the minimum possible number of moving parts.” In general, anything with a lot of parts or lots of different parts is seen as complex, from large-scale networks to complex crystal structures. But is the number of parts, or the number of moving parts, or even the number of different parts, really a proxy for complexity? Not necessarily. Two specific examples provide counters to the idea that the number of moving parts is a good determinate of complexity.
第一个如图 1.1所示——以高分辨率拍摄的三维曼德尔球分形。4
The first of these is illustrated in Figure 1.1—a three-dimensional Mandelbulb fractal taken at a high resolution.4
4 . “文件:Mandelbulb140a.JPG — 维基共享资源”,np,2014 年 7 月 8 日访问,https://commons.wikimedia.org/wiki/File :Mandelbulb140a.JPG 。
4. “File:Mandelbulb140a.JPG—Wikimedia Commons,” n.p., accessed July 8, 2014, https://commons.wikimedia.org/wiki/File:Mandelbulb140a.JPG.
虽然插图中有很多明显的复杂性,但它实际上都是通过迭代运行相同的算法创建的 - 因此,实际上,有只有一个“移动的部分”。第二个例子来自网络世界,是两种路由协议的交互,例如开放最短路径优先 (OSPF) 和边界网关协议 (BGP)。虽然这两种协议的设计和构造都相当简单,但在大规模网络中部署任何一种协议实际上都涉及很大的复杂性。在 OSPF 之上分层 BGP 所增加的复杂性比不熟悉网络设计的人想象的要大得多。这两个协议之间存在许多意外的交互,从计算最佳路径到确定用于特定目的地的下一跳。它们在聚合时的交互(响应网络拓扑的变化)可能特别有趣。
While there is a lot of apparent complexity in the illustration, it is all actually created by iteratively running the same algorithm—so, in reality, there is only one “moving piece.” A second example, from the networking world, is the interaction of two routing protocols, such as Open Shortest Path First (OSPF) and the Border Gateway Protocol (BGP). While both of these protocols are fairly simple in their design and construction, deploying either one in a large-scale network actually involves a good deal of complexity. Layering BGP on top of OSPF adds a much larger amount of complexity than someone unfamiliar with network design might expect. There are a number of unexpected interactions between these two protocols, from calculating the best path to determining the next hop to use toward a specific destination. Their interaction while converging (in reaction to a change in the network topology) can be especially entertaining.
复杂的模式可以由简单的规则集产生,这应该提醒您不要将复杂性的外观与复杂性本身相混淆。仅仅因为某件事看起来复杂并不意味着它实际上很复杂;相反,你必须超越表象才能真正理解复杂性。某物有很多活动部件的东西不一定很复杂,就像只有几个活动部件的东西不一定很简单一样。真正重要的是运动部件之间的相互作用。
That complex patterns can result from simple rule sets should remind you not to confuse the appearance of complexity with complexity itself. Simply because something looks complex does not mean it actually is complex; instead you must reach beyond appearances to get a true understanding of complexity. Something with a lot of moving parts isn’t necessarily complex, any more than something with just a few moving parts is necessarily simple. What really matters is the interaction between the moving parts.
• 系统的每个部分与其他部分交互的次数越多,并且系统的任何给定部分交互的部分越多,系统本身就会越复杂。系统各部分之间交互的数量可以称为交互平面;交互平面越大,系统各部分之间的交互就越多。同样,交互平面越小,系统各部分之间的交互就越少。
• The more often each piece of a system interacts with others, and the more pieces any given piece of a system interacts with, the more complex the system itself will be. The number of interactions between the various parts of a system can be called the interaction plane; the larger the plane of interaction, the more the parts of the system interact with one another. Likewise, the smaller the plane of interaction, the less the parts of the system interact with one another.
• 系统各部分之间的关系越深,系统就越复杂。
• The deeper the relationship between the various parts of a system, the more complex the system will be.
图 1.2以更直观的方式说明了这些概念。
Figure 1.2 illustrates these concepts in a more visual way.
举一个更具体的例子,考虑一个设计为跨网络运行的应用程序。按照复杂性增加的顺序提供了四种可能性,与图 1.2中的插图平行:
To give a more concrete example, consider an application designed to run across a network. Four possibilities are provided in order of increasing complexity, parallel to the illustration in Figure 1.2:
• 应用程序的一个实例使用传输控制协议(TCP) 传输信息块,而第二个实例则接收该信息。该应用程序依赖于 TCP 和底层网络来准确无误地传递这些信息块,但不拥有有关 TCP 进程当前状态或实现的任何信息。这将是一个浅而窄的交互表面;应用程序在多个地方与 TCP 交互,但对 TCP 的内部状态知之甚少。
• One instance of the application transmits blocks of information using the Transmission Control Protocol (TCP), while a second instance receives this information. The application relies on TCP and the underlying network to deliver these blocks of information without error, but doesn’t possess any information about the TCP process’ current state or implementation. This would be a shallow, narrow interaction surface; the application interacts with TCP in multiple places, but knows little about TCP’s internal state.
• 应用程序的一个实例使用 TCP 传输信息块,而第二个实例则接收该信息。由于传输的信息有时可能对时间敏感,因此应用程序会不时检查 TCP 队列的状态,调整发送块的速率,并且可以请求 TCP 进程推送块,而不是等待TCP 缓冲区已填满。这是一个更深层次的交互界面,因为应用程序现在必须了解 TCP 的工作方式,并与 TCP 的工作方式进行智能交互。然而,交互面仍然很窄,因为它仅存在于 TCP 和应用程序之间的多个点上。
• One instance of the application transmits blocks of information using TCP, while a second instance receives this information. Because the information being transmitted can sometimes be time sensitive, the application examines the state of the TCP queue from time to time, adjusting the rate at which it is sending blocks, and can request the TCP process to push a block, rather than waiting for the TCP buffer to fill up. This is a deeper interaction surface because the application must now understand something about the way TCP works, and interact with how TCP works intelligently. The interaction surface is still narrow, however, as it is only between TCP and the application at multiple points.
• 应用程序的一个实例使用 TCP 传输信息块,而第二个实例则接收该信息。与第一个示例一样,应用程序不会以任何方式与 TCP 状态进行交互;它只是将数据块放入 TCP 的缓冲区中,并假设 TCP 将通过网络正确传送数据。然而,该应用程序对时间敏感,因此还会定期读取时钟,而时钟的准确性取决于网络时间协议 (NTP)。这个交互面可以被认为更广泛一些,因为应用程序现在正在与两种不同的网络协议交互,但两种交互都相当浅。
• One instance of the application transmits blocks of information using TCP, while a second instance receives this information. As in the first example, the application doesn’t interact with TCP’s state in any way; it simply places blocks of data into TCP’s buffer and assumes TCP will deliver it correctly across the network. The application is, however, time sensitive, and therefore also reads the clock, which is dependent on Network Time Protocol (NTP) for accuracy, on a regular basis. This interaction surface can be considered a bit broader, as the application is now interacting with two different network protocols, but both interactions are rather shallow.
• 应用程序的一个实例使用 TCP 传输信息块,而第二个实例则接收该信息。由于传输的信息对时间敏感,因此应用程序会不时检查 TCP 队列的状态,调整发送块的速率,并且可以请求 TCP 进程推送块,而不是等待 TCP缓冲区填满。为了对这些推送请求进行计时,应用程序会监视本地时间,并假设使用 NTP(或某种类似的解决方案)在多个设备之间保持同步。此示例中的交互面既广泛又深入,因为应用程序再次需要了解 TCP 数据传输的内部工作原理。它也更广泛,因为应用程序正在与两个协议(或更大意义上的系统)交互,而不是一个。事实上,在这个例子中,NTP 的状态现在通过应用程序与 TCP 的内部状态绑定在一起,并且两个协议都不知道这一点联系。
• One instance of the application transmits blocks of information using TCP, while a second instance receives this information. Because the information being transmitted is time sensitive, the application examines the state of the TCP queue from time to time, adjusting the rate at which it is sending blocks, and can request the TCP process to push a block, rather than waiting for the TCP buffer to fill up. To time these push requests, the application monitors the local time, which it assumes is kept synchronized among multiple devices using NTP (or some similar solution). The interaction surface in this example is both broader and deeper, in that the application again needs to know the inner workings of TCP’s data transmission. It is also broader, in that the application is interacting with two protocols (or systems, in the larger sense), rather than one. In fact, in this example, the state of NTP is now tied to the internal state of TCP through the application—and neither protocol knows about this connection.
那么,交互面就是两个系统交互的位置数量以及这些交互的深度;每个附加的应用程序编程接口 (API)、套接字或其他接触点都会增加交互表面。很明显,随着系统内两个子系统之间交互的数量和级别的增加,复杂性级别也会提高。
The interaction surface, then, is simply the number of places where two systems interact and the depth of those interactions; each additional Application Programming Interface (API), socket, or other point of contact increases the interaction surface. It’s clear that as the amount and level of interaction between two subsystems within a system increases, the complexity level is driven up.
这些概念将与网络设计和体系结构更全面地相关,因为本书将通过各种示例进行介绍,但请考虑一下:例如,OSPF 和中间系统到中间系统 (IS-IS) 作为“夜间船舶”运行路由协议。它们都依赖于有关 IP 地址、链路状态、指标以及直接从网络拓扑中提取的其他信息的相同信息。OSPF 和 IS-IS 即使在同一网络上运行,也不会交互,除非它们被配置为交互(例如通过重新分配)。OSPF 和 BGP 交互方式相同吗?否,因为 BGP 依赖于底层 IGP 来提供用于构建对等关系和下一跳信息的 IP 可达性。使用此处描述的模型,OSPF 和 BGP 之间的交互面比 OSPF 和 IS-IS 之间的交互面更大,
These concepts will be related to network design and architecture more fully as this book works through various examples, but consider this: OSPF and Intermediate System-to-Intermediate System (IS-IS), for instance, run as “ships in the night” routing protocols. They both rely on the same information about IP addresses, link states, metrics, and other information drawn directly from the network topology. OSPF and IS-IS, even if they’re running on the same network, don’t interact unless they are configured to interact (through redistribution, for instance). Do OSPF and BGP interact in the same way? No, because BGP relies on the underlying IGP to provide IP reachability for building peering relationships and next hop information. Using the model described here, the interaction surface is larger between OSPF and BGP than it is between OSPF and IS-IS, and the APIs are fairly opaque.
几乎工程界的每个人都看过鲁布·戈德堡机器的漫画,如图1.3所示。5
Virtually anyone in the engineering world has seen a cartoon of a Rube Goldberg Machine, such as the one in Figure 1.3.5
5 . 公共区域; 摘自:“Professor_Lucifer_Butts.gif (428 × 302)”,np,2014 年 7 月 7 日访问,https://upload.wikimedia.org/wikipedia/commons/a/a6/Professor_Lucifer_Butts.gif。
5. Public domain; taken from: “Professor_Lucifer_Butts.gif (428 × 302),” n.p., accessed July 7, 2014, https://upload.wikimedia.org/wikipedia/commons/a/a6/Professor_Lucifer_Butts.gif.
这些装置常常被贴上解决某些问题的简单方法的标签,从吃饭时用餐巾(如图1.3所示)到顶起汽车更换轮胎(诱导汽车顶上的大象移动到平台上)将汽车的一端撬到空中)。简单是关键词,因为这些机器显然一点也不简单。但为什么这样的机器被认为很复杂——甚至复杂到了幽默的地步?因为它们说明了有关复杂性本质的两个具体观点。
Quite often these contraptions were labeled as a simple way to solve some problem, from using a napkin during a meal (as in Figure 1.3) to jacking a car up to replace the tire (an elephant atop the car is induced to move onto a platform that levers one end of the car up into the air). Simple is the key word, because these machines were obviously anything other than simple. But why are such machines considered complex—even to the point of being humorously complex? Because they illustrate two specific points about the nature of complexity.
鲁布·戈德堡的装置总是针对简单问题的多步骤解决方案。全自动餐巾机如图1.3所示例如,取代了拿起餐巾纸擦嘴的简单动作。当你显然有一只手可以用来完成同样的任务时,使用这样的机器是荒谬的——即使你的另一只手没有空,把叉子放在桌子上然后拿起餐巾比用它来完成任务更简单。建造这台疯狂的机器来做同样的事情。那么,您对复杂性的看法与正在解决的问题和所提供的解决方案之间的关系有关。任何添加不必要的步骤、交互或部分来解决特定问题的系统都被视为复杂的,无论问题或解决方案客观地来看多么简单。
Rube Goldberg’s contraptions are always multistep solutions for a simple problem. The automatic napkin machine shown in Figure 1.3, for instance, replaces the simple action of picking the napkin up and wiping your mouth with it. It’s absurd to use such a machine when you obviously have one hand free that could be used for the same task—and even if your other hand isn’t free, it’s simpler to put a fork down on the table and pick up a napkin than to build this crazy machine to do the same thing. Your perception of complexity, then, is related to the relationship between the problem being solved and the solution offered. Any system that adds unnecessary steps, interactions, or parts to solve a specific problem is seen as complex, no matter how simple the problem or the solution when viewed objectively.
鲁布·戈德堡的装置总是专注于手头的问题,而排除所提出的解决方案的副作用。自动餐巾需要每次使用时必须更换的火箭、必须喂食的小鸟、必须上弦和维护的时钟以及每次使用后必须更换的饼干。这个解决方案真的比拿起餐巾纸更简单吗?用于撬起汽车以便更换轮胎的大象必须被搬运和喂食。当大象睡着了,需要更换轮胎时,车主该怎么办?
Rube Goldberg’s contraptions always focus on the problem at hand to the exclusion of the side effects of the solution proposed. The automatic napkin requires a rocket that must be replaced each time it is used, a bird that must be fed, a clock that must be wound and maintained, and a biscuit that must be replaced after each use. Is the solution really simpler than picking a napkin up? The elephant used to lever the car up so the tire can be replaced must be carried and fed. What should the owner of the car do when the elephant is asleep and the tire needs to be replaced?
鲁布·戈德堡的喜剧天才和他的神奇机器告诉我们:
The comic genius of Rube Goldberg and his fabulous machines teaches us:
• 解决方案的复杂性必须与要解决的问题直接相关。如果有更简单的解决方案可用,工程师将(并且应该)倾向于该解决方案。将此称为工程复杂性的“奥卡姆剃刀”:如果提出了两种解决方案来解决问题,则应始终首选较简单的解决方案,因为两种解决方案都能同样出色地解决问题。
• The complexity of the solution must be directly related to the problem being solved. If there is a simpler solution available, engineers will (and should) gravitate to that solution. Call this the “Occam’s Razor” of complexity in engineering: if two solutions have been proposed to resolve a problem, the simpler solution should always be preferred, given both solutions solve the problem equally well.
• 如果焦点过于集中,很容易造成不必要的复杂性。一个更简单的解决方案实际上可能看起来更简单,直到它满足未来的考验或与现实结合。设计一个仅包含所需编码的协议似乎更容易,并且没有像 TLV 这样混乱的结构,至少在您需要修改协议来解决某些问题之前是这样。您在进行初始设计工作时没有想到的问题。换句话说,(显然)最干净、最简单的设计往往会出现意想不到的脆弱性。同样,在实际部署解决方案之前,并非特定解决方案的所有问题都是显而易见的。
• It is easy to create unnecessary complexity by narrowing the focus too far. A simpler solution might actually seem simpler until it meets the test of the future, or engages with reality. It might seem easier to design a protocol with just the encodings needed, and without messy constructions like TLVs—at least until you need to modify the protocol to address some problem you didn’t think of when doing the initial design work. To put this another way, there is often an unexpected brittleness to the (apparently) cleanest and simplest design available. In the same way, not all the problems with a particular solution are obvious until the solution is actually deployed.
当您考虑网络架构和网络协议设计的现实世界中复杂性的具体实例时,这两个原则的示例会定期出现。
Examples of these two principles will crop up on a regular basis as you consider specific instances of complexity in the real worlds of network architecture and network protocol design.
1996 年,讨论世界上最著名的足球比赛超级碗的网站被各种搜索引擎屏蔽。为什么互联网上的搜索引擎会屏蔽与足球相关的网站?因为每届超级碗都是从 1967 年第一场比赛 Super Bowl I 开始按顺序编号的。名称中的数字始终采用罗马数字。1996 年超级碗是第 30 场比赛,因此被恰当地称为 Super Bowl XXX。但 XXX 也代表了互联网上广泛使用的一种特定类型的材料,许多人不希望在搜索结果中看到这种材料。因此,超级碗XXX因为“XXX”而被许多搜索引擎屏蔽。6
In 1996, websites discussing the Super Bowl, the most famous football game in the world, were being blocked by various search engines. Why would any search engine on the Internet block sites related to football? Because each Super Bowl is numbered sequentially starting in 1967 with the first game, Super Bowl I. The numbers in the name have always been in roman numerals. The 1996 Super Bowl was game number 30, so it was appropriately called Super Bowl XXX. But XXX also represents a particular type of material widely available on the Internet that many people don’t want to see in their search results. Hence, Super Bowl XXX, because of the “XXX,” was blocked from many search engines.6
6 . “电子费率和过滤:儿童互联网保护法审查”(一般。能源和商业,电信和互联网小组委员会,2001 年 4 月 4 日),np,2014 年 7 月 8 日访问,http://www . gpo.gov/fdsys/pkg/CHRG-107hhrg72836/pdf/CHRG-107hhrg72836.pdf。
6. “E-Rate and Filtering: A Review of the Children’s Internet Protection Act” (General. Energy and Commerce, Subcommittee on Telecommunications and the Internet, April 4, 2001), n.p., accessed July 8, 2014, http://www.gpo.gov/fdsys/pkg/CHRG-107hhrg72836/pdf/CHRG-107hhrg72836.pdf.
这个(也许很有趣)的故事是试图审查互联网内容的历史上一长串此类事件之一,但它更强调了意外后果的力量。1996 年,罗伯特·K·默顿 (Robert K. Merton) 列出了意外后果的五个来源。7其中三个直接适用于大型计算机网络系统:
This (perhaps amusing) story is one of a long chain of such incidents in the history of attempting to censor content on the Internet—but it makes a larger point about the power of unintended consequences. In 1996, Robert K. Merton listed five sources of unintended consequences.7 Three of these directly apply to large computer network systems:
7 . 罗伯特·K·默顿,“有目的的社会行动的意外后果”,《美国社会学评论》 1,第 1 期。6(1936 年 12 月 1 日):894–904,2015 年 9 月 15 日访问,http://www.jstor.org/stable/2084615。
7. Robert K. Merton, “The Unanticipated Consequences of Purposive Social Action,” American Sociological Review 1, no. 6 (December 1, 1936): 894–904, accessed September 15, 2015, http://www.jstor.org/stable/2084615.
• 无知,无法预测一切,从而导致分析不完整
• Ignorance, making it impossible to anticipate everything, thereby leading to incomplete analysis
• 分析问题时出现错误,或遵循过去有效但可能不适用于当前情况的习惯
• Errors in analysis of the problem or following habits that worked in the past but may not apply to the current situation
• 眼前利益高于长远利益
• Immediate interests overriding long-term interests
意外后果的概念如何应用于网络的复杂性?随着系统变得更加复杂,根据任何给定的输入来预测输出变得更加困难。例如,对互联网路由系统中各种输入的真实结果进行了大量研究——拓扑更新或可达性在整个互联网上传播需要多长时间,一个提供商如何改变或实施特定政策会影响其他提供商,以及看似简单的想法(例如抑制路线抖动)如何在现实世界中发挥作用?必须进行研究才能找到这些问题的答案,这意味着互联网总体上是一个复杂的系统。给定特定的输入,您可以猜测可能的结果,
How does the concept of unintended consequences apply to the complexity of a network? As a system becomes more complex, it becomes more difficult to predict the output based on any given input. As an example, a lot of research is done into the real results of various inputs into the Internet routing system—how long does it take for an update in topology or reachability to propagate throughout the Internet as a whole, how does one provider changing or implementing a specific policy impact other providers, and how do seemingly straightforward ideas, such as route flap dampening work in the real world? That research must be undertaken to discover the answers to these questions implies that the Internet, at large, is a complex system. Given a specific input, you can guess at a likely outcome, but there’s no assurance that the outcome will be what you expect.
具有大量互连部件的大型复杂系统很难分析,导致分析不完整,从而因无知而产生意想不到的后果。在一个足够大的系统中,有足够多的组件具有透明的交互和大的交互表面,几乎不可能知道单个更改将影响系统的所有方式。
Large complex systems with a lot of interconnected parts are difficult to analyze, leading to incomplete analysis, and hence unintended consequences through ignorance. In a large enough system, with enough components that have transparent interactions and large interaction surfaces, it’s almost impossible to know all the ways in which a single change will affect the system.
同样,大型复杂系统往往更多地通过经验(凭感觉飞翔)和经验法则来管理——如果遵循规则,进行全面分析的成本通常被认为比失败的成本高得多。大拇指的方向是错误的。这再次涉及到规模和子系统相互作用的复杂性问题。系统越复杂,分析中的错误就越有可能渗透到日常运营模型中。
In the same way, large complex systems tend to be managed more by experience (seat of the pants flying) and rule of thumb—the cost to do a full analysis is often perceived to be much higher than the cost of a failure if the rule of thumb is wrong. This again relates to the problem of complexity through the scale and subsystem interaction. The more complex a system, the more likely errors in analysis are to creep into everyday operational models.
笔记
Note
在飞行员拥有可以告诉他们飞机角度(称为偏航、俯仰和滚转)的仪器之前,他们会“凭感觉飞行”。从字面上看,这意味着他们会根据转弯时是否在座位上滑动来判断飞机的速度与转弯的锐度。如果你在座位上滑动,那么你的飞行速度不够快,无法转弯——转弯时飞机速度的离心力应该使飞行员保持在座位上。
Before pilots had instruments that could tell them the angle of the airplane (called yaw, pitch, and roll), they would “fly by the seat of their pants.” This literally means that they would judge the speed of the plane in relation to the sharpness of a turn by whether or not they slid in their seat when making the turn. If you were sliding in your seat, you weren’t flying fast enough for the turn—the centrifugal force of the speed of the plane in the turn should keep the pilot in place in the seat.
最后一个问题是每个网络工程师都熟悉的问题——凌晨两点的电话、应用程序宕机,如果不“立即启动”,公司将损失数百万美元,以及获取信息的捷径回来工作。我们总是告诉自己,我们会在早上查看它,或者我们会将其放在待办事项列表中,以便稍后修复,但是立即的暴政很快就会接管,并且黑客仍然作为正常运行情况网络的。一个由一千个路由器组成的网络中的一次黑客攻击似乎不会产生很多负面后果。但实际上,一次黑客攻击就可以迅速瘫痪一千台路由器的网络,而在拥有一万台路由器的网络中遭受一千次黑客攻击,一场灾难即将发生。
The final problem is one every network engineer knows well—it’s the two-in-the-morning phone call, the application that’s down and will cost the company millions if it’s not “up—right now,” and the shortcut taken to get things back working. We always tell ourselves we’ll look at it in the morning, or we’ll put it on the to-do list to fix sometime later, but the tyranny of the immediate takes over soon enough, and the hack stays in as part of the normal operational profile of the network. A single hack in a network of a thousand routers might not seem like it will have many negative consequences. In reality, however, a single hack can quickly bring down a thousand-router network, and a thousand hacks in a network of ten thousand routers, however, are a disaster just waiting to happen.
笔记
Note
仅仅为了解决当前问题而不考虑未来影响而构建复杂性概念的另一个术语是技术债务。8
Another term for the concept of building in complexity simply to address a problem at hand, without considering the future impact, is technical debt.8
8 . “技术债务”,维基百科,免费百科全书,2015 年 9 月 2 日,2015 年 9 月 15 日访问,https://en.wikipedia.org/w/index.php ?title=Technical_debt&oldid= 679133748。
8. “Technical Debt,” Wikipedia, the Free Encyclopedia, September 2, 2015, accessed September 15, 2015, https://en.wikipedia.org/w/index.php?title=Technical_debt&oldid=679133748.
意外后果的力量告诉我们,为了更好地理解和管理网络复杂性,工程师需要专注于分析和理解用于构建功能网络的各种组件之间的各种状态和交互的能力。您可以应用越多的工具和概念来理解网络可能陷入的各种状态(例如模型和测量工具),您就越能够更好地处理现实世界中的复杂性。同时,人类的理解是有限的,因此任何人可以预见的副作用的数量也是有限的。
The power of unintended consequences teaches that to better understand, and manage, network complexity, engineers need to focus on the ability to analyze and understand the various states and interactions between the various components used to build a functioning network. The more tools and concepts you can apply to understanding the various states into which a network can fall—such as models and measurement tools—the better you will be able to deal with complexity in the real world. At the same time, there is a limit to human understanding, and therefore a limit to the number of side effects anyone can foresee.
如果复杂性如此——复杂——那么为什么不设计更简单的网络和协议呢?换句话说,为什么从长远来看,网络世界中每一次让事情变得更简单的尝试最终都会使事情变得更加复杂?例如,通过在 IP 之上(或通过 IP)建立隧道,可以降低控制平面的复杂性,并且整体上使网络变得更简单。那么,为什么隧道覆盖层最终会包含如此多的复杂性呢?
If complexity is so—complex—then why not just design networks and protocols that are simpler? To put the question another way, why does every attempt to make anything simpler in the networking world end up apparently making things more complex in the long run? For instance, by tunneling on top of (or through) IP, the control plane’s complexity is reduced, and the network is made simpler overall. Why is it, then, that tunneled overlays end up containing so much complexity?
这个问题有两个答案。首先,人性就是如此,工程师总是会发明十种不同的方法来解决同一问题。在虚拟世界中尤其如此,新的解决方案(相对)容易部署,(相对)容易发现最后一组提出的解决方案的问题,并且(相对)容易移动一些位来创建新的解决方案。一种“比旧解决方案更好”的新解决方案。换句话说,虚拟空间之所以如此混乱,是因为在那里很容易构建新的东西。
There are two answers to this question. The first is that human nature being what it is, engineers will always invent ten different ways to solve the same problem. This is especially true in the virtual world, where new solutions are (relatively) easy to deploy, it’s (relatively) easy to find a problem with the last set of proposed solutions, and it’s (relatively) easy to move some bits around to create a new solution that is “better than the old one.” The virtual space, in other words, is partially so messy because it’s so easy to build something new there.
然而,第二个答案在于一个更根本的问题:复杂性对于处理难以解决的问题所涉及的不确定性是必要的。奥尔德森和多伊尔指出:
The second answer, however, lies in a more fundamental problem: complexity is necessary to deal with the uncertainty involved in difficult to solve problems. Alderson and Doyle state:
然而,我们认为,最简洁地讨论复杂性是在功能及其稳健性方面。具体来说,我们认为,高度组织化的系统的复杂性主要源于旨在为其环境和组件的不确定性创造稳健性的设计策略。9
In our view, however, complexity is most succinctly discussed in terms of functionality and its robustness. Specifically, we argue that complexity in highly organized systems arises primarily from design strategies intended to create robustness to uncertainty in their environments and component parts.9
9 . David L. Alderson 和 John C. Doyle,“复杂性的对比观点及其对以网络为中心的基础设施的影响”,IEEE Transactions on Systems, Man, and Cybernetics 40,no。4(2010 年 7 月):840。
9. David L. Alderson and John C. Doyle, “Contrasting Views of Complexity and Their Implications for Network-Centric Infrastructures,” IEEE Transactions on Systems, Man, and Cybernetics 40, no. 4 (July 2010): 840.
这种说法可以用图1.4所示的图表来表达。
This statement can be expressed in a chart as shown in Figure 1.4.
这是违反直觉的——事实上,它几乎与大多数有关网络工程的讨论相反。工程师通常认为增加简单性会增加稳健性,但事实并非如此。相反,增加复杂性会增加鲁棒性,直到解决方案超出鲁棒性曲线上的峰值。为什么会这样呢?因为不确定性。作为一个简单的例子,让我们回到网络协议中经常使用的 TLV 编码。哪个更好?
This is counterintuitive—in fact, it’s almost the opposite of most discussions around network engineering. Engineers often assume that increasing simplicity leads to increasing robustness—but this is not true. Instead, increasing complexity increases robustness until the solution moves beyond the peak on the robustness curve. Why should this be? Because of uncertainty. As a simple example, let’s return to TLV encodings often used in network protocols. Which is better?
• 设计一个能够以其原始格式处理大量情况的协议,并且还支持许多在设计协议时没有想到的不同扩展。
• Designing a protocol that can handle a large number of situations in its original format, and also support many different extensions that hadn’t been thought of when the protocol was designed.
• 设计一个协议,使用尽可能少的信息集来支持设计阶段一开始提出的要求。
• Designing a protocol that will support, using the minimal set of information possible, the requirements laid out at the very beginning of the design phase.
在单一协议的情况下,对于第二种选择有一个强有力的论据——设计协议以支持设计阶段开始时提出的要求,并使用所需的最少信息。具体原因有两个;第二个可能看起来是最优的。
There is a strong argument to be made, in the single protocol case, for the second option—designing the protocol to support the requirements presented at the beginning of the design phase with the minimal amount of information required. There are two specific reasons; the second might appear to be the most optimal.
优化设计的协议的在线配置文件始终小于采用灵活附加功能设计的协议。对于任何 TLV,都必须有一个 TLV 标头 — 必须描述值的类型和长度。另一方面,任何旨在以最佳方式仅携带一组特定信息的协议都不需要携带任何有关所携带信息是什么的信息。换句话说,如果协议要足够灵活以在将来添加新的数据类型,则必须随数据携带元数据或数据描述。然而,如果协议是“封闭的”,则元数据是协议描述的一部分,并且不需要与数据一起在线路上传送。灵活协议中的元数据是内部化的,或者与数据本身保持一致,以创造灵活性;它被外部化,或者位于协议实现中,以优化线路上的位的使用。
The on-the-wire profile of an optimally designed protocol will always be smaller than one designed with flexible additions. For any TLV, there must be a TLV header—something must describe the type and length of the value. On the other hand, any protocol that is designed to optimally carry just a specific set of information doesn’t need to carry any information about what the carried information is. To put this in other terms, the metadata, or data description, must be carried with the data if the protocol is to be flexible enough to add new data types in the future. If the protocol is “closed,” however, the metadata is part of the protocol description, and need not be carried on the wire with the data. The metadata in a flexible protocol is internalized, or carried in line with the data itself to create flexibility; it is externalized, or located in the protocol implementation, to create optimal use of bits on the wire.
优化设计的协议的处理配置文件总是比通过灵活添加而设计的协议更好。同样,围绕 TLV 或未来允许承载更多类型信息的任何其他格式设计的协议将需要更复杂的处理。数据包中的偏移量不能用于查找数据包中任何位置包含的任何信息 — 相反,必须“遍历”数据流以查找下一个 TLV 标头,并且必须根据每个 TLV 的一组 TLV 进行处理规则。有关此概念的说明请参见图1.5 。
The processing profile of an optimally designed protocol will always be better than one designed with flexible additions. In the same way, a protocol designed around TLVs, or any other format that allows more types of information to be carried in the future, will require more complex processing. Offsets into the packet cannot be used to find any piece of information contained anywhere in the packet—instead, the data stream must be “walked,” to find the next TLV header, and the TLV must be processed according to a set of per TLV rules. See Figure 1.5 for an illustration of this concept.
比较两种情况下查找包含 X 值的八位字节所需的处理。对于最佳格式的数据包:
Compare the processing required to find the octet containing the value of X in both cases. For the optimally formatted packet:
• 数出14 个八位字节。
• Count off 14 octets.
• 从第14 个八位位组的内容中读取X 的值。
• Read the value of X from the contents of the 14th octet.
对于TLV格式的数据包:
For the TLV formatted packet:
• 读取第一个TLV 标头。
• Read the first TLV header.
• 这是Y TLV;找到长度并跳到数据包中 TLV 的末尾。
• This is a Y TLV; find the length and skip to the end of the TLV in the packet.
• 读取第二个TLV 标头。
• Read the second TLV header.
• 这是一个Z TLV,找到长度并跳到数据包中TLV 的末尾。
• This is a Z TLV, find the length and skip to the end of the TLV in the packet.
• 读取第三个TLV 标头。
• Read the third TLV header.
• 这是X TLV。
• This is an X TLV.
• 根据该特定TLV 的格式跳转到X TLV,并读取X 的值。
• Jump into the X TLV, based on the format of this particular TLV, and read the value of X.
对于最佳格式的数据包的处理要简单得多;处理 TLV 需要将更多位移入内存或移出内存、进行检查等。针对承载非常具体的数据片段而优化的协议可以组织该数据,使处理更容易,从而降低处理器利用率。这在数据包交换领域尤其重要,在数据包交换领域,定制硬件用于以非常高的速率交换数据包,以及其他使用硬件近实时处理数据包的地方。
The processing for the optimally formatted packet is much simpler; processing TLVs requires more bits to be moved into and out of memory, examined, etc. Protocols optimized to carry very specific pieces of data can have that data organized to make processing easier, and hence to reduce processor utilization. This is particularly important in the area of packet switching, where customized hardware is used to switch packets at a very high rate, and other places where hardware is used to process packets in near real time.
然而,在这个等式的另一边,存在意想不到的(或不可预测的)情况。使用 TLV 示例可以让您以两种方式了解这一点:未来的扩展和错误处理。
On the other side of this equation, however, there is the unexpected (or unpredicted). Using the TLV example allows you to see this in two ways: future extensions and error handling.
假设您设计了一些新协议,该协议在线路和处理要求方面都进行了完美优化,以传输有关一家大公司每个工厂每天生产的小部件数量的信息。很快,该公司看到了为任何想要购买这些小部件的人提供贷款的机会,并且出现了一项新要求:通过网络传输小部件贷款信息的能力。本着完美优化网络性能的精神,您设计了一种新协议来通过网络传输贷款信息 - 同样,该协议旨在最大限度地减少整个网络的带宽利用率和处理要求。随着贷款业务的扩大,该公司认为高金融是一项不错的业务,因此他们决定扩大其网点,不仅销售小部件,还销售五到六种其他商品,并为每种商品提供融资。问题很快就变成了 - 您是否应该继续设计和部署单独的协议来单独管理每个新需求,或者您是否应该首先设计一个可以管理更广泛需求的单一、灵活的协议?
Assume you’ve designed some new protocol that is perfectly optimized both on the wire and in terms of processing requirements to transport information about the number of widgets being produced on a daily basis in each factory at a large company. Soon enough, the company sees an opportunity in offering loans for anyone who wants to buy one of these widgets, and a new requirement arises: the ability to transport information about loans for widgets across the network. In the spirit of perfectly optimized network performance, you design a new protocol to transport loan information across the network—again, the protocol is designed to minimize bandwidth utilization and processing requirements throughout the network. As the loan business expands, the company decides high finance is a good business to be in, so they decide to expand their outlets to sell not just widgets, but five or six other items, and to provide financing for each of those items as well. The question quickly becomes—should you continue designing and deploying individual protocols to manage each new requirement separately, or should you have designed a single, flexible protocol that could manage a wider range of requirements in the first place?
在前面讨论的 TLV 格式示例中,TLV 格式的数据包需要更多的在线带宽和更多的处理能力,但它们也允许使用单个协议来计算小部件、小部件的贷款、一般贷款和其他产品一般。如果协议被设计为在一组目标内管理广泛的数据类型,那么系统实际上最终会比每个目标通过单独的协议来满足的系统更简单。
In the TLV format example discussed previously, TLV formatted packets require more on-the-wire bandwidth, and more power to process, but they also allow for a single protocol to be used for counting widgets, loans for widgets, loans in general, and other products in general. If a protocol is designed to manage a broad range of data types within a single set of goals, then the system actually ends up being simpler than one in which each goal is met with a separate protocol.
这个例子可能有点牵强,但当你进入本书更实用的部分时,你会变得更加明显,这是设计师和建筑师经常面临的一个真正的问题。总是存在一种通过简单地“在顶部”放置一个新协议来扩展网络的诱惑,但如果没有关于目标、功能分离和域分离的一些扎实的想法,“在顶部”很快就会成为“意大利面条”的委婉说法。盘子的顶部。” 这种现象的另一个例子是将专为某一目的设计的协议置于完全不同的角色。例如,资源预留协议 (RSVP) 最初设计用于为特定数据包流预留沿路径的队列和处理空间。
This example might be a little stretched, but as you get into the more practical sections of this book it will become ever more apparent just how it is a real question designers and architects face on a regular basis. There is always a temptation to extend the network by simply putting a new protocol “over the top,” but without some solid ideas about goals, functional separation, and domain separation, “over the top,” quickly becomes a euphemism for “spaghetti on top of the plate.” Another example of this phenomenon is putting a protocol designed for one purpose into a completely different role. For instance, the Resource Reservation Protocol (RSVP) was originally designed to reserve queue and processing space along a path for a particular flow of packets. The most common use for RSVP today is the signaling of traffic engineering paths through a network—a purpose far outside the original design of the protocol.
现实世界中不确定性的另一个来源是错误。无论出于何种原因,事情并不总是按计划进行。协议或网络如何对这些意外事件做出反应是一个至关重要的考虑因素,尤其是当网络成为生活“正常”的一部分,从游戏到金融交易再到医疗程序的一切都依赖网络时。
Another source of uncertainty in the real world is errors; for whatever reason, things don’t always go as planned. How a protocol or network reacts to these unexpected events is a crucial consideration, especially as networks become a “normal” part of life, relied on for everything from gaming to financial transactions to medical procedures.
在处理网络错误方面,为了增加鲁棒性而增加复杂性的一个例子是许多协议中的错误检测或纠正代码。图 1.6说明了一个简单的基于奇偶校验的方案。
An example of added complexity for added robustness in the area of handling network errors is the error detection or correction code found in many protocols. Figure 1.6 illustrates a simple parity-based scheme.
奇偶校验位是错误检测码的一个简单示例。建包时,会统计二进制 1 的总数,以确定 1 的个数是偶数还是奇数。如果1的数量是奇数,则将奇偶校验位设置为1,使1的数量为偶数。如果接收到的数据包中 1 的总数(包括奇偶校验位)为奇数,则接收方知道该数据包在传输过程中一定已以某种方式损坏。这不会告诉接收者正确的信息是什么,但它确实让接收者知道它需要请求数据的另一个副本。
The parity bit is a simple example of an error detection code. When the packet is built, the total number of binary 1s is counted to determine if there is an even or odd number of 1s. If the number of 1s is odd, then the parity bit is set to 1 to make the number of 1s even. If a packet is received where the total number of 1s, including the parity bit, is odd, the receiver knows that the packet must have been corrupted during transmission in some way. This doesn’t tell the receiver what the correct information is, but it does let the receiver know it needs to ask for another copy of the data.
添加奇偶校验位会增加数据包处理的复杂性。每个数据包必须存储在某个位置的缓冲区中,同时计算数据包中 1 的数量,忽略奇偶校验位本身。如果 1 的数量是奇数,则必须在数据包传输之前设置奇偶校验位。同样,接收方在接受数据之前必须采取额外的步骤来计算数据包中 1 的数量,并且协议必须内置某种机制,以便接收方请求另一份信息副本。这种增加的复杂性值得吗?这完全取决于在正常操作中通过此类系统检测到的故障发生的频率,或者单个故障的灾难性程度。随着错误检测或纠正机制的复杂性增加,应用程序从格式错误的传输中恢复的能力也得到了增强。代价是额外的数据包处理,以及通过添加纠错码本身可能引入的错误。
Adding a parity bit increases the complexity of packet processing. Each packet must be stored in a buffer someplace while the number of 1s in the packet are counted, ignoring the parity bit itself. If the number of 1s is odd, then the parity bit must be set before the packet is transmitted. The receiver must likewise take the extra step of counting the number of 1s in the packet before accepting the data, and the protocol must have some mechanism built in for the receiver to ask for another copy of the information. Is this added complexity worth it? It all depends on how often failures that can be detected through such a system happen in normal operation, or how catastrophic a single failure would be. As the complexity of the error detection or correction mechanism ramps up, the ability of the application to recover from malformed transmissions is also increased. The cost is additional packet processing, along with the errors potentially introduced through the addition of the error correction code itself.
增加复杂性可以使网络更轻松地处理未来的需求和意外事件,并通过较小的基本功能集提供更多服务。如果是这种情况,为什么不简单地构建一个在单个网络上运行的单个协议,该协议可以处理可能抛出的所有需求,并且可以处理您可以想象的任何事件序列?运行单一协议的单一网络肯定会减少网络工程师需要处理的移动部件的数量,使我们的生活变得更简单,对吧?
Adding complexity, then, allows a network to handle future requirements and unexpected events more easily, as well as provides more services over a smaller set of base functions. If this is the case, why not simply build a single protocol running on a single network that can handle all the requirements potentially thrown at it, and can handle any sequence of events you can imagine? A single network running a single protocol would certainly reduce the number of moving parts network engineers need to deal with, making all our lives simpler, right?
也许不会。在某些时候,任何复杂的系统都会变得脆弱——强大但脆弱是一个可以用来描述这种情况的短语。当一个系统能够对一组预期的情况做出弹性反应时,它是强大但脆弱的,但一组意外的情况将导致它失败。举一个现实世界的例子——刀片需要具有某种独特的特性组合。它们必须足够坚硬,可以固定边缘并进行切割,但又必须足够灵活,可以在使用时稍微弯曲,恢复到原来的形状,没有任何损坏的迹象,并且在跌落时不得破碎。需要多年的研究和经验才能找到合适的金属来制造刀片,并且仍然存在关于哪种材料适合特定性能、在什么条件下等的长期而深入的技术讨论。
Maybe not. At some point, any complex system becomes brittle—robust yet fragile is one phrase you can use to describe this condition. A system is robust yet fragile when it is able to react resiliently to an expected set of circumstances, but an unexpected set of circumstances will cause it to fail. To give an example from the real world—knife blades are required to have a somewhat unique combination of characteristics. They must be hard enough to hold an edge and cut, and yet flexible enough to bend slightly in use, returning to their original shape without any evidence of damage, and they must not shatter when dropped. It has taken years of research and experience to find the right metal to make a knife blade from, and there are still long and deeply technical discussions about which material is right for specific properties, under what conditions, etc.
为了理解复杂性,制作刀片时有一个特别令人感兴趣的阶段:回火过程。为了回火刀片,首先将刀片加热到非常高的温度,然后冷却。重复此过程数次可使钢的分子排列,从而在钢内形成晶粒,如图1.7所示。10
There is one specific stage of making a knife blade of particular interest in the quest to understand complexity: the tempering process. To temper a knife blade, the blade is first heated to a very high temperature, and then allowed to cool. Repeating this process several times aligns the molecules of the steel so it forms a grain within the steel, as shown in Figure 1.7.10
10 . 图片取自http://practicalmaintenance.net/wp-content/uploads/High-carbon-AISI-1095-Steel.jpg。
10. Image taken from http://practicalmaintenance.net/wp-content/uploads/High-carbon-AISI-1095-Steel.jpg.
这种纹理就像木材中的纹理一样,产生尺寸强度,从而使所得的钢材非常坚硬——事实上,通过这个过程,钢材会变得非常坚硬,如果掉落在坚硬的表面(例如混凝土)上,就会破碎。或瓷砖地板。此时的刀片坚固但脆弱。它能够很好地实现了其主要设计(切割材料),但它对意外事件(掉落在坚硬表面上或施加扭转)的反应不佳。为了使刀可用,刀片在实际使用之前必须进行“回火”。为此,将钢加热(通常加热到比钢硬化时使用的温度稍低的温度),然后在油浴中淬火。这一过程破坏了硬化过程中产生的晶粒的稳定性,使钢材稍微变软,但在此过程中,钢材也变得更加柔韧。该过程的最终结果是刀片具有锋利、足够坚硬以进行切割,但又足够灵活以适合日常使用。
This graining acts just like the grain in wood, creating dimensional strength, thus making the resulting piece of steel very hard—in fact, steel can become so hard through this process that it can shatter if dropped on a hard surface, such as a concrete or tile floor. The blade, at this point, is robust yet fragile; it is able to achieve its primary design (cutting material) very well, but it doesn’t react well to unexpected events (being dropped on a hard surface, or having torsion applied). To make the knife useable, the blade must be “detempered,” before it is actually used. To do this, the steel is heated (normally to a slightly lower temperature than used in hardening the steel), and then quenched in a bath of oil. This process destabilizes the graining the hardening process has created, making the steel slightly softer—but in the process, the steel also becomes much more flexible. The end result of the process is a blade that holds an edge, is hard enough to cut, and yet is flexible enough for everyday use.
通过强化过程创建的粒度代表了以各种方式放入网络和协议中的复杂性,例如通过添加 TLV 将元数据添加到数据包格式中,或者在源和目标之间添加更多路径,甚至自动化通常处理的过程人类在处理网络更改时所犯的错误数量。还需要以分层协议的形式进行扰乱过程,为网络的不同部分设置一般(通常是狭窄的)目标,打破故障域等。
The grain created through the hardening process represents the complexity put into networks and protocols in various ways, such as adding metadata into the packet format by adding TLVs, or adding more paths between the source and destination, or even automating a process that is normally handled by humans to reduce the number of mistakes made in handling changes to the network. The distempering process is also needed, as well, in the form of layering protocols, setting general (and often narrow) goals for different parts of the network, breaking up failure domains, etc.
那么,复杂性可以被视为一种权衡。如果在一个方向上降低规模太远,您最终会得到一个没有弹性的网络,因为没有冗余,或者存在单个故障域等。如果在另一个方向上降低规模,您最终会得到一个没有弹性的网络,因为协议和系统无法应对快速变化。在这个规模上不存在“完美点”——就像钢铁一样,这完全取决于网络工程师要实现的目标。另外两个来自技术领域的插图将有助于巩固这一概念。
Complexity, then, can be seen as a tradeoff. If you go too far down the scale in one direction, you wind up with a network that isn’t resilient because there is no redundancy, or there is a single failure domain, etc. If you go down the scale in the other direction, you wind up with a network that isn’t resilient because the protocols and systems cannot cope with a rapid change. There is no “perfect point” on this scale—just as with steel, it all depends on the goals the network engineer sets out to meet. Two more illustrations, taken from the technical world, will help to cement this concept.
几乎每个人都知道这一点,但值得定期重复。面对任何决定,您都会有三个目标:快速、便宜和高质量。在这三个中,您可以选择任意两个,但不能选择全部三个。如果您选择廉价且快速的解决方案,那么您几乎肯定不会最终获得高质量的解决方案。如果选择质优价廉的解决方案,实施起来会需要很长时间。您可以用一种不寻常的方式可视化这种三向权衡,如图1.8所示。
Just about everyone knows this, but it bears repeating on a regular basis. Faced with just about any decision, you will have three goals: quick, cheap, and high quality. Of those three, you can choose any two—but never all three. If you choose a solution that is cheap and quick, you will almost certainly not end up with a quality solution. If you choose a solution that is high quality and cheap, it will take a long time to implement. You can visualize this three-way tradeoff in a somewhat unusual way, as shown in Figure 1.8.
在图 1.8中,较暗的阴影区域可以称为“现实领域”。更大、阴影更浅的三角形可以称为“目标领域”。虽然目标包含所有三种可能性,但现实是结构化的,因此您只能实现给定目标的一部分。您可以选择(如最左侧)平均平衡所有三个目标。在下一个插图中,焦点已经转向快速,从而导致远离廉价和高质量。
In Figure 1.8, the darker shaded area is what might be called “the realm of reality.” The larger, more lightly shaded triangle is what might be called “the realm of goals.” While the goals contain all three possibilities, reality is structured so you can only fill some part of the goals given. You can choose, as on the far left, to balance all three goals equally. In the next illustration, the focus has been moved to quick, with a resulting movement away from cheap and high quality.
在 2000 年计算机协会分布式计算原理研讨会上,Eric Brewer 发表了一篇题为“迈向鲁棒分布式系统”的论文。11在本文中,Brewer 指出分布式系统不能很好地工作,因为应用程序要求信息在分布式数据库的所有副本之间保持一致。Brewer 认为,要实现真正的分布式计算,应用程序设计者必须放弃一致性。2002年,布鲁尔定理被证明是正确的,现在被称为CAP定理。CAP 定理简要指出,您不可能拥有一个一致、可用且具有分区容错性的数据库 — 您只能选择三个属性中的两个。为了更好地理解 CAP 定理在其本机环境中的情况,让我们更详细地了解一下所涉及的三个术语。
At the Association for Computing Machinery’s Symposium on the Principles of Distributed Computing in 2000, Eric Brewer presented a paper titled “Towards Robust Distributed Systems.”11 In this paper, Brewer noted that distributed systems don’t work well because applications demand that the information be consistent among all copies of a distributed database. Brewer argued that to get to true distributed computing, application designers must give up consistency. In 2002, Brewer’s theorem was proven correct, and it is now known as the CAP theorem. The CAP theorem briefly states that you cannot have a database that is consistent, available, and exhibits partition tolerance—you can only choose two of the three properties. To better understand the CAP theorem in its native environment, let’s look at the three terms involved in a little more detail.
11 . Eric Brewer,“迈向稳健的分布式系统”,2000 年 7 月 19 日,np,2014 年 7 月 11 日访问,http://wisecracked/~brewer/cs262b-2004/PODC-keynote.pdf。
11. Eric Brewer, “Towards Robust Distributed Systems,” July 19, 2000, n.p., accessed July 11, 2014, http://wisecracked/~brewer/cs262b-2004/PODC-keynote.pdf.
•当数据库的任何用户无论何时何地读取相同的信息时,数据库都是一致的。例如,如果您将某件商品放入在线零售商的购物车中,则同一网站上的所有其他用户都应该看到购物车中的商品减少了库存。如果数据库不一致,那么两个人可以订购相同的商品,但只有一个人显示在库存中。一致性通常称为原子性;原子操作始终使数据库处于一种状态,即所有用户在每次操作后都具有相同的数据视图。
• A database is consistent when any user of a database will read the same information no matter when and where the read takes place. For instance, if you put an item in your shopping cart at an online retailer, every other user on that same site should see the inventory reduced by the item in your cart. If the database is not consistent, then two people can order the same item while only one shows as being in inventory. Consistency is often called atomicity; atomic operations always leave a database in a state where all users have the same view of the data after every operation.
•当没有用户被拒绝访问数据库进行读(或经常写)操作时,数据库是可访问的。虽然可访问性的定义通常是可变的,但一般来说,这意味着数据库不可访问的时间永远不会超过依赖数据库的外部进程或用户可以容忍的时间。
• A database is accessible when no user is refused access to the database for read (or often write) operations. While the definition of accessibility is often variable, in general, it means that the database is never inaccessible for longer than an outside process or user that relies on the database can tolerate.
• 当数据库可以分布在由网络分隔的多个设备和进程上而不影响数据库本身的整体操作时,它就可以容忍分区。根据定义,分布式数据库在包含数据库某些部分的每台机器上进行分区,并且通常需要分区来支持现实世界中数据库的性能和弹性要求。
• A database tolerates partitions when it can be spread across multiple devices and processes separated by a network without impacting the overall operation of the database itself. Distributed databases are, by definition, partitioned across each of the machines that contain some part of the database—and partitioning is often required to support the performance and resilience requirements of databases in the real world.
如图1.8所示,只需将标签替换为concentrate、accessible、partitionable,就可以用来说明CAP定理。
The diagram shown in Figure 1.8, by simply replacing the labels with consistent, accessible, and partitionable, can be used to illustrate the CAP theorem.
笔记
Note
CAP 定理在网络工程中的一个直接应用在于观察到路由协议只是分布式实时数据库系统。
One direct application of the CAP theorem to network engineering lies in the observation that routing protocols are simply distributed real-time database systems.
到目前为止,我们对复杂性的了解可以总结为以下几句话:
What we’ve learned about complexity to this point can be summarized in a few statements:
• 复杂性有一系列定义,围绕可理解性(包括意想不到的后果)、表面交互以及问题与用于解决问题的系统的关系等概念。
• Complexity has a range of definitions centering around the concept of comprehensibility (including unintended consequences), surface interactions, and the relationship of the problem to the system being used to solve the problem.
• 复杂性是对未来和现实世界双重不确定性的反应。
• Complexity is a reaction to the twin uncertainties of the future and the real world.
• 构建在有限的物理或虚拟空间内支持大量功能和功能的系统需要复杂性。
• Complexity is required to build systems supporting a lot of capabilities and functions within a restricted physical or virtual space.
• 复杂性是一组不可能对立的权衡,而不是单个“事物”。
• Complexity is a set of impossibly opposed tradeoffs, rather than a single “thing.”
为了进一步阐述最后一点,质量/速度/成本难题和 CAP 定理也可以成对而不是三个来看待。在任何一对相关项目中,两个项目之间都存在一条曲线,可以使用公式C ≤ 1/ R进行描述,如图 1.9所示,作为成本和脆弱性之间的权衡。这条曲线通常被称为图灵曲线。
To carry this last point further, the quality/speed/cost conundrum and the CAP theorem can be looked at in pairs rather than threes, as well. In any pair of related items, there is a curve between the two items that can be described using the formula C ≤ 1/R, illustrated in Figure 1.9, as a tradeoff between cost and fragility. This curve is often referred to as the Turing curve.
图 1.9 C ≤ 1/R 显示为脆弱性增加和成本增加之间的权衡
Figure 1.9 C ≤ 1/R Shown as a Tradeoff between Increasing Fragility and Increasing Cost
沿着这条曲线有一系列“最佳点”;工程师的工作是有意识或无意识、明智或不明智地选择曲线上的哪个点对每项权衡都有意义。
There is a range of “sweet spots,” along this curve; the job of the engineer is to choose which spot along the curve makes sense for every tradeoff—consciously or unconsciously, wisely or unwisely.
接下来的两章将更详细地研究复杂性的其他几个方面,特别是与计算机网络相关的方面。他们首先在下一章中研究复杂性的组成部分——组成网络的移动部件以及它们如何交互。第 3 章“测量网络复杂性”将研究测量网络复杂性的各种尝试,以检查该领域可用的工具。在这两章之后,本书将转向更实际的问题,目的是让网络工程师意识到他们在设计和部署协议以及设计和部署网络时所处理的复杂性权衡。当您走向复杂性的中心时,这个练习应该会带来更多的工程选择智慧。
The next two chapters investigate several more aspects of complexity in greater detail, particularly in relation to computer networks. They begin by examining the components of complexity in the next chapter—the moving parts that make up a network, and how they interact. Chapter 3, “Measuring Network Complexity,” will investigate various attempts at measuring complexity in a network to examine the tools available in this space. Following those two chapters, the book will turn to more practical matters, with the aim of making network engineers conscious of the complexity tradeoffs they are handling in designing and deploying protocols, and designing and deploying networks. This exercise should lead to more wisdom in engineering choices as you journey to the center of complexity.
在损坏的网络上工作始终是一项令人兴奋的工作,特别是如果您为一家大型供应商工作,每天都会有大量损坏的网络送到您家门口。为了在不断的攻击中生存下来,您最终开发了一组简单而快速的补丁或更改,您可以始终依赖这些补丁或更改来稳定网络,以便您可以开始实际解决问题的过程。例如,网络工程师在处理距离矢量控制平面中的故障时可能采取的一组步骤可以是:
Working on broken networks is always an exciting business to be in—particularly if you work for a large vendor where large broken networks are brought to your door every day. To survive the constant onslaught, you eventually develop a set of simple and quick patches or changes you can always rely on to settle a network down so you can start the process of actually troubleshooting the problem. For example, one set of steps network engineers dealing with failures in a distance-vector control plane might take to stabilize the network could be:
• 查看路由协议拓扑数据库。
• Look at the routing protocol topology database.
• 确定到达表中任何给定目的地的平均路径数。
• Determine how many paths, on average, there are to any given destination in the table.
• 将接口配置为被动接口(这样它们就不会交换可达性信息),直到并行路径的平均数量小于 4。
• Configure interfaces as passive (so they won’t exchange reachability information) until the average number of parallel paths is less than 4.
让我们再举一个例子:假设您正在检查一个不会收敛的网络,并且您注意到协议表中存在大量外部路由 - 假设表中 75% 或更多的路由是外部路由,并且它们似乎都年龄很短。为了稳定网络,您可能会采取的第一步是什么?找到重新分配点并用静态路由的重新分配替换协议之间的任何动态重新分配。
Let’s take another example: assume that you’re examining a network that won’t converge, and you notice a lot of external routes in the protocol tables—say 75% or more of the routes in the table are externals, and they all seem to have a very short age. What’s the first step you’re likely to take in stabilizing the network? Find the redistribution points and replace any dynamic redistribution between protocols with redistribution of static routes.
但是为什么这些技术应该作为启动和运行故障网络的初始步骤,以便可以采取进一步的故障排除和设计修复步骤呢?因为它们攻击所有三个主要组成部分大规模网络中的复杂性:状态数量、状态变化率以及交互面的范围。
But why should either of these techniques work as an initial step in getting a failed network up and running so further troubleshooting and design remediation steps can be taken? Because they attack all three of the major components of complexity in a large-scale network: the amount of state, the rate of state change, and the scope of interaction surfaces.
为了大规模构建和管理弹性网络,工程师需要管理复杂性。虽然这说起来很容易,但做起来却很难。与任何工程问题一样,第一步是决定如何设置问题。如何将问题分解为几个较小的部分,以便每个部分都可以单独管理?这些部分如何相互作用?通过考虑复杂性的这三个组成部分(状态、速度和表面)来检查复杂性问题,将有助于网络设计人员和架构师以有效且平衡的方式解决复杂性问题。虽然您可以从很多地方开始研究这个领域的复杂性,但控制平面融合是一个很好的起点,因为它涉及许多问题以及许多其他网络系统。
To build and manage resilient networks at scale engineers are going to manage complexity. While this is easy enough to say, it’s hard to do. As with any engineering problem, the first step is to decide how to set the problem up. How can the problem be broken into a few smaller pieces, so each one can be managed separately? How do these pieces interact? Examining the problem of complexity with these three components of complexity in mind—state, speed, and surfaces—will help network designers and architects address complexity in an effective and balanced way. While there are a number of places you could begin looking at the complexity in this space, control plane convergence is a good place to begin, because it touches many of the issues, and many of the other network systems.
网络融合是一个典型的例子,从中可以得出网络复杂性的各个组成部分。这是网络工程的一个领域,几乎每个在网络领域工作的人都以某种方式遇到过,并且复杂性的各个组成部分很容易作为复杂性领域内的单独概念来梳理和理解。
Network convergence is a prototypical example from which to draw the various components of network complexity. It’s an area of network engineering almost everyone who works in the networking field has encountered in some way, and the various components of complexity are fairly easy to tease out and understand as individual concepts within the realm of complexity.
在边界网关协议(BGP)收敛的情况下,收敛时间的主要组成部分包括:
In the case of Border Gateway Protocol (BGP) convergence, the major components in convergence time include:
• 最小路由通告间隔(MRAI):该计时器旨在防止状态变化淹没系统,特别是防止形成正反馈循环(请参阅第8 章“复杂系统如何失败”)。
• The Minimum Route Advertisement Interval (MRAI): This timer is designed to prevent state changes from overwhelming the system, particularly in preventing positive feedback loops from forming (see Chapter 8, “How Complex Systems Fail”).
• 处理和完成最佳路径计算所需的时间,特别是在路由服务器、路由反射器和其他处理比正常 BGP 路径集更大的设备中。
• The amount of time it takes to process and complete the best path calculations, particularly in route servers, route reflectors, and other devices that handle a larger than normal set of BGP paths.
• 任何特定发言者上的 BGP 进程与设备中的其他进程(例如路由信息库 (RIB) 和其他协议进程)交互所花费的时间。
• The amount of time the BGP process on any particular speaker spends interacting with other processes in the device, such as the Routing Information Base (RIB) and other protocol processes.
BGP 融合如图 2.1所示。
Figure 2.1 illustrates BGP convergence.
在该网络中,当Router F到2001:db8:0:2::/64的链路出现故障时:
In this network, when the link from Router F to 2001:db8:0:2::/64 fails:
• 路由器F 向路由器D、E 和B 发送撤回消息。路由器B 因为这是其到达2001:db8:0:2::/64 的最佳路径,所以将检查其到该目的地的可用路径。通过最佳路径过程,RouterB会选择经过RouterD的路径作为新的最佳路径,并向RouterA发送显式撤消的通告。
• Router F sends a withdraw to Routers D, E, and B. Router B, because this is its best path to 2001:db8:0:2::/64, will examine its available paths to this destination. Through the best path process, Router B will choose the path through Router D as the new best path, and send an advertisement with an explicit withdraw toward Router A.
• 接下来,路由器D 和E 完成对2001:db8:0:2::/64 的路径丢失的处理。RouterD向RouterB发送撤回消息;路由器E向路由器C发送撤回消息。
• Routers D and E finish processing the loss of the path to 2001:db8:0:2::/64 next. Router D sends a withdraw to Router B; Router E sends a withdraw to Router C.
• 路由器 B 在收到来自路由器 D 的撤销后,检查其表并确定现在最佳路径是沿路径 [C,E,F] — 请注意,此时路由器 C 仍在处理从路由器 E 收到的撤销因此RouterB仍然认为经过RouterC的路径是可用的。路由器 B 确定它应该向路由器 A 发送带有隐式撤销的新更新,但它现在必须等待 MRAI 超时才能发送。
• Router B, on receiving the withdraw from Router D, examines its table and determines that the best path is now along the path [C,E,F]—note that Router C is still processing the withdraw it received from Router E at this point, so Router B still believes that the path through Router C is available. Router B determines that it should send a new update with an implicit withdraw to Router A, but it must now wait for the MRAI to time out before it can.
• 路由器 C 现在完成对从路由器 E 收到的撤销的处理,并向路由器 B 发送撤销。
• Router C now finishes processing the withdraw it received from Router E, and sends a withdraw toward Router B.
• RouterB 检查其本地表,发现没有通往 2001:db8:0:2::/64 的路径。现在它向Router A 发送撤回消息,完成收敛过程。
• Router B examines its local table, and finds that it has no path toward 2001:db8:0:2::/64. It now transmits a withdraw to Router A, finishing the convergence process.
此示例展示了 BGP 在收敛时如何有效地从最短路径到最长路径工作。当 BGP 学习到新的目的地时,也会发生相同的情况。1
This example shows how BGP effectively works from the shortest path to the longest when converging. The same situation occurs when BGP learns a new destination.1
1 . Shivani Deshpande 和 Biplab Sikdar,“路由处理和 MRAI 计时器对 BGP 收敛时间的影响”,2012 年 4 月 27 日,np,2015 年 7 月 8 日访问,http: //www.ecse.rpi.edu/homepages/sikdar /papers/gbcom04s.pdf。
1. Shivani Deshpande and Biplab Sikdar, “On the Impact of Route Processing and MRAI Timers on BGP Convergence Times,” April 27, 2012, n.p., accessed July 8, 2015, http://www.ecse.rpi.edu/homepages/sikdar/papers/gbcom04s.pdf.
对于增加或减少自治系统 (AS) 路径的每个“周期”,MRAI 都会增加一个 MRAI 计时器收敛所需的时间。运行最佳路径的处理时间也会对 BGP 收敛所需的时间产生重大影响,特别是在 BGP 发言者必须处理大量路由(例如路由服务器或路由反射器)的情况下。运行最佳路径的时间还受到 BGP 进程与路由器上运行的其他进程(例如 RIB 进程)的交互的影响。构建可靠的 BGP 实现并不是一件容易的事——世界上只有少数可靠的、广泛使用的 BGP 实现。
The MRAI increases the time required to converge by one MRAI timer for each “cycle” of increasing or decreasing the autonomous system (AS) Path. The processing time of running best path can also have a major impact on the time required for BGP to converge, especially in cases where the BGP speaker must process a large number of routes (such as a route server or route reflector). The time to run best path is also impacted by the interaction of the BGP process with other processes running on the router, such as the RIB process. Building a solid BGP implementation is not an easy task—there are only a handful of solid, widely used, BGP implementations in the world.
虽然增强型内部网关协议 (EIGRP) 的使用不再像以前那样广泛,但仍然值得研究一下 EIGRP 融合过程,以深入了解分布式控制平面的融合方式。图 2.2说明了用于讨论 EIGRP 收敛的网络。
While Enhanced Interior Gateway Protocol (EIGRP) isn’t as widely used as it once was, it’s still worth looking at the EIGRP convergence process to gain a solid understanding of the way distributed control planes converge. Figure 2.2 illustrates a network used for discussing EIGRP convergence.
这个过程相当简单:
This process is fairly simple:
1.路由器 D 发现它已丢失与 2001:db8:0:2::/64 的链路,并检查其本地表以查找备用路由。如果没有找到,它会向路由器 C 发送查询,以发现路由器 C 是否有备用路由。路由器 D在等待对此查询的响应时,将到 2001:db8:0:2::/64 的路由置于活动状态。
1. Router D discovers that it has lost its link to 2001:db8:0:2::/64 and examines its local table for an alternate route. Finding none, it sends a query to Router C, to discover if Router C has an alternate route. Router D places the route to 2001:db8:0:2::/64 in the active state while it awaits the response to this query.
2.路由器 C 接收此查询并检查其本地表以确定是否有替代路由(除了通过路由器 D 之外)。如果没有找到,它就会向路由器 B 发送查询,并将到 2001:db8:0:2::/64 的路由置于活动状态,同时等待对此查询的响应。
2. Router C receives this query and examines its local table to determine if it has an alternate route (other than through Router D). Finding none, it will then send a query to Router B, and place the route to 2001:db8:0:2::/64 into the active state while it waits for the response to this query.
3.路由器 B 收到此查询并检查其本地表以确定是否有备用路由(除了通过路由器 C 之外)。如果没有找到,它将向路由器 A 发送查询,并将到 2001:db8:0:2::/64 的路由置于活动状态,同时等待对此查询的响应。
3. Router B receives this query and examines its local table to determine if it has an alternate route (other than through Router C). Finding none, it will then send a query to Router A, and place the route to 2001:db8:0:2::/64 into the active state while it waits for the response to this query.
4.路由器 A 收到此查询,并且找不到备用路由,也没有找到任何可以询问该目的地的其他邻居,因此从其本地路由表中删除该路由并向路由器 B 发送回复。
4. Router A receives this query and, finding no alternate route, nor any other neighbors that it can ask about this destination, removes the route from its local routing table and sends a reply to Router B.
5.路由器 B 收到此回复,从其本地路由表中删除该目的地,并向路由器 C 发送回复。
5. Router B receives this reply, removes the destination from its local routing table, and sends a reply to Router C.
6. Router C 收到 Router B 的回复,从本地路由表中删除该目的地,并向 Router D 发送回复。
6. Router C receives the reply from Router B, removes the destination from its local routing table, and sends a reply to Router D.
7. RouterD 收到 RouterC 的回复后,将 2001:db8:0:2::/64 从本地路由表中删除。
7. Router D receives the reply from Router C and removes 2001:db8:0:2::/64 from its local routing table.
这可能看起来需要大量工作(最坏的情况是有意从处理速度的角度来说明该过程),但每个路由器都可以相当快地处理查询或回复,因为每个步骤几乎没有什么可做的。事实上,在“野外”运行的典型 EIGRP 网络中,每个查询跃点收敛所需的平均时间约为 200 毫秒。
This might seem like a lot of work (and the worst possible case has intentionally been to illustrate the process from a speed of processing perspective), but each router can process the query or reply fairly rapidly, because there is little to do for each step. In fact, in typical EIGRP networks running “in the wild,” the average amount of time required to converge per query hop is around 200 milliseconds.
尽管如此,很容易看出正在处理的状态量如何对收敛所需的时间产生重大影响,从而对网络的稳定性产生重大影响,因为每个路由器必须串行处理有关网络拓扑修改的信息。方式。例如,路由器 A 无法完全处理它所拥有的与 2001:db8:0:2::/64 连接丢失的信息,直到查询范围内的所有其他路由器都已经处理了该信息。
Nonetheless, it is easy to see how the amount of state being processed could have a major impact on the time it takes to converge, and hence the stability of the network, as each router must process information about modifications in the network’s topology in a serial way. Router A cannot, for instance, fully process the information it has about the loss of connection to 2001:db8:0:2::/64 until every other router within the scope of the query has already processed this information.
虽然每个路由器可能只需要 200 毫秒来处理拓扑更改,但具有数百或数千个更改的单个事件将导致查询路径中的每个路由器单独处理每个可到达目的地的更改(非常类似于 BGP)。因此,可达性的大规模变化可能会给受查询影响的每个设备的处理器和存储带来很大的压力。
While each router might only need 200 milliseconds to process the topology change, a single event with hundreds or thousands of changes will cause each router in the query path to process the change for each reachable destination separately (much like BGP). Hence a large-scale change in reachability can put a good deal of stress on the processor and storage for every device impacted by the query.
链路状态协议(例如 OSPF 和 IS-IS)具有不同的收敛属性集;图 2.3说明了这个过程。
Link state protocols, such as OSPF and IS-IS, have a different set of convergence attributes; Figure 2.3 illustrates this process.
图示步骤如下:
The steps illustrated are:
1. RouterD发现2001:db8:0:2::/64不再可达。为了响应网络拓扑的这种变化,它会构建一个链路状态通告(LSA,对于OSPF),或者重建其链路状态协议数据单元(PDU;LSP是链路状态数据包,它类似于OSPF中的LSA) OSPF),并向路由器 C 通告此新信息。
1. Router D discovers that 2001:db8:0:2::/64 is no longer reachable. In response to this change in the network topology, it will build a Link State Advertisement (LSA, for OSPF), or rebuild its Link State Protocol Data Unit (PDU; an LSP is a Link State Packet, which is similar to an LSA in OSPF), and advertise this new information toward Router C.
2.路由器 C 在收到此新信息后,将简单地将副本转发给其邻居路由器 B,而不处理该信息。路由器 C 最终将处理此信息,但链路状态协议通常首先泛洪,然后再处理,以提高参与控制平面的所有设备的数据库同步的速度。
2. Router C, on receiving this new information, will simply forward a copy along to its neighbor, Router B, without processing the information. Router C will eventually process this information, but link state protocols typically flood first, and process later, to increase the speed at which the databases of all the devices participating in the control plane will be synchronized.
3.路由器 B 在收到此新信息后,将简单地将副本转发给其邻居路由器 A,而不处理该信息。
3. Router B, on receiving this new information, will simply forward a copy along to its neighbor, Router A, without processing the information.
4.在稍后的某个时间点(由链路状态协议内的计时器设置),路由器 D 将运行本地最短路径优先 (SPF) 计算,以确定本地路由表中需要更改的内容。结果将是路由器 D 从其本地路由表中删除 2001:db8:0:2::/64。
4. At some point later in time (set by a timer within the link state protocol), Router D will run a local Shortest Path First (SPF) computation to determine what needs to be changed in the local routing table. The result will be Router D removing 2001:db8:0:2::/64 from its local routing table.
5.在路由器 D 计算出新的最短路径树 (SPT) 后不久,路由器 C 也会执行同样的操作,通过删除 2001:db8:0:2::/64 来调整其本地路由表。
5. Shortly after Router D computes a new Shortest Path Tree (SPT), Router C will do likewise, adjusting its local routing table by removing 2001:db8:0:2::/64.
6.在路由器 C 之后不久,路由器 B 将执行相同的计算,并得到相同的结果。
6. Shortly after Router C, Router B will perform the same computation, with the same results.
7. Finally, Router A will perform the same computation, with the same results.
笔记
Note
这是汇聚链路状态控制平面所需处理的稍微简化的视图;本书稍后将在复杂性的背景下考虑更多细节。读者还可以查看IS-IS for IP Networks 2等书籍,更深入地了解链路状态处理。
This is a somewhat simplified view of the processing required to converge a link state control plane; more detail will be considered in the context of complexity later in this book. Readers can also look at books such as IS-IS for IP Networks2 to understand link state processing more deeply.
2 . Russ White 和 Alvaro Retana,IS-IS:IP 网络中的部署,第一版。(波士顿:艾迪生韦斯利,2003 年)。
2. Russ White and Alvaro Retana, IS-IS: Deployment in IP Networks, 1st edition. (Boston: Addison-Wesley, 2003).
在这种情况下,数据包中携带的状态量会影响网络拓扑更改的处理时间:
In this case, the amount of state being carried in the packet impacts the processing time for a network topology change in:
• 网络中从一个路由器到另一个路由器传输链路状态信息所花费的时间。这将包括将数据包序列化到线路上、对数据包从线路上计时、对数据包进行排队等。任何需要多个数据包以这种方式在网络上泛洪的网络拓扑更新必然需要更长的时间。描述网络拓扑变化所需的状态越多,携带该信息所需的数据包数量就越多。
• The amount of time it takes to transmit the link state information from router to router in the network. This would include serializing the packet onto the wire, clocking the packet off the wire, queuing the packet, etc. Any network topology update that requires more than one packet to flood in this way across the network will necessarily take longer. The more state required to describe the changes in the network topology, the larger the number of packets required to carry that information.
• 处理网络拓扑更改所需的时间、处理能力和内存量将取决于更改的数量或现有状态的数量。当然,有多种方法可以优化此处理(例如部分 SPF),几乎达到这样的程度:附加状态通常只能对收敛时间产生微不足道的影响。
• The amount of time, processing power, and memory it takes to process the changes to the network topology will depend on the number of changes, or the amount of existing state. There are, of course, ways to optimize this processing (such as partial SPFs), almost to the point that the additional state can often only have a trivial effect on the time it takes to converge.
对于链路状态协议来说,当网络拓扑发生变化时,协议中携带的大量状态与路由协议收敛所需的时间之间仍然存在联系。
For link state protocols, there is still a connection between the sheer amount of state carried in the protocol and the time required for the routing protocol to converge when a change in the network topology occurs.
在大型系统中,状态的绝对数量可能是巨大的——不仅对于在系统上工作的人员来说,而且对于必须管理和处理信息的协议和计算机系统来说也是如此。让我们看看状态量很重要的一些原因。
In large-scale systems, the sheer amount of state can be overwhelming—not only for the people working on the system, but also for the protocols and computer systems that must manage and process the information. Let’s look at some of the reasons why the amount of state matters.
第一个因素是需要通过网络传输才能收敛的信息量。
The first of these factors is the amount of information that needs to be transferred across the network to converge.
暂时考虑一下封装单个 BGP 更新所包含的信息量。根据数据包格式和历史信息,假设单个 BGP 更新消耗大约 1500 字节(八位组)的内存。截至撰写本文时,当前完整表大小超过 50 万个目的地,这将需要至少 795MB 的空间必须在网络中的路由器之间传递。这不包括 TCP 开销、用于格式化数据的 TLV 以及其他传输要求。
Consider, for a moment, the amount of information contained in encapsulating a single BGP update. Based on packet format and historical information, assume that a single BGP update consumes about 1500 bytes (octets) of memory. At the time of this writing, the current full table size is over a half a million destinations, which will require at least 795MB that must be passed around between the routers in the network. This doesn’t include TCP overhead, TLVs for formatting the data, and other transport requirements.
在 5GB 演示文稿的世界中,795MB 的数据可能看起来不多,但请记住这是一个在大量路由器上运行的分布式数据库。多少?截至撰写本文时,大约有 48,000 个 AS 连接到互联网。3任何给定AS中的BGP发言者数量可以在十到数千之间;由于没有真正的方法可以知道单个 AS 中发言者的平均数量是多少,我们使用保守的估计,并将其称为每个 AS 10 个路由器。根据这些数字,这个 795MB 的表正在大约 480,000 台设备之间同步。现在听起来不那么小了,是吗?
795MB of data might not seem like a lot in a world of 5GB presentations, but remember this is a distributed database running on a very large number of routers. How many? There are around 48,000 AS connected to the Internet at the time of this writing.3 The number of BGP speakers in any given AS can be between ten and thousands; given there’s no real way to know what the average number of speakers in a single AS is, let’s use a conservative estimate, and call it 10 routers per AS. With these numbers, this 795MB table is being synchronized between some 480,000 devices. Doesn’t sound so small now, does it?
3 . “Cymru 团队互联网监控 — BGP 唯一 ASN 计数”,np,2014 年 8 月 24 日访问,http://www.cymru.com/BGP/unique_asns.html。
3. “Team Cymru Internet Monitor—BGP Unique ASN Count,” n.p., accessed August 24, 2014, http://www.cymru.com/BGP/unique_asns.html.
这可能令人印象深刻,但普通网络并不是互联网。即便如此,由 1000 个路由器组成的网络仍然可以近乎实时地在这 1000 个路由器之间保持数兆字节的表同步——以秒或毫秒为单位,而不是几分钟、几小时或几天。
This might all be impressive, but the average network isn’t the Internet. Even so, a network of 1000 routers is keeping a multi-megabyte table synchronized across those 1000 routers in near real time—in seconds or milliseconds, rather than minutes, hours, or days.
对于距离矢量协议,更新中携带的信息量也是一个因素,但方式不同。例如,运行距离矢量协议的网络中的每个并行链路代表由这些并行链路连接的两个设备之间携带的可达性信息的全新副本。这些额外的信息副本有时可能会变得不同步,或者遇到导致可达性视图不一致的其他问题。这些额外的并行链路还代表潜在的反馈循环,可能导致网络永远无法收敛。因此,在上面的示例中,网络工程师可能会关闭并行链路以稳定距离矢量控制平面。这可以从网络中删除足够的附加状态,以允许控制平面完全收敛,
For distance-vector protocols, the amount of information carried in the updates is also a factor, but in a different way. For instance, each parallel link in a network running a distance-vector protocol represents a completely new copy of the reachability information being carried between the two devices connected by these parallel links. These additional copies of the information can sometimes become desynchronized, or encounter other problems that cause an inconsistent view of reachability. These additional parallel links also represent potential feedback loops that can cause the network to never converge. Hence, in the example above, the network engineer might shut down parallel links to stabilize a distance-vector control plane. This can remove enough additional state from the network to allow the control plane to fully converge, bringing the network back into operation while a deeper analysis of the problems that brought the failure about can be investigated.
链路状态更新中携带的信息量以大致相同的方式影响链路状态协议的操作。信息在网络中传播不仅需要更长的时间,而且还会建立一个更大的数据库,需要在该数据库上运行 SPF 算法来构建一致的网络拓扑视图。如果附加信息位于最短路径经过的节点中,而不是仅沿着网络边缘离开,则附加信息会直接影响 SPF 运行的速度。
The amount of information carried in link state updates impacts the operation of a link state protocol in much the same way. Not only does it take longer for the information to be flooded through the network, it also builds a bigger database across which the SPF algorithm needs to be run to construct a consistent view of the network topology. If the additional information is in nodes through which shortest paths pass, rather than just leaves along the edge of the network, the additional information can impact the speed at which SPF runs directly.
大多数网络故障并不是纯粹的“状态驱动”事件;几乎总是涉及状态、速度和表面的某种组合。然而,有一些几乎纯粹是状态驱动的,例如大型中心辐射网络中的链路反弹导致的网络崩溃。图 2.4提供了一个可供讨论的基本网络。
Most network failures are not pure “state-driven” events; there is almost always some combination of state, speed, and surface involved. There are a few, however, that are almost purely state driven, such as network meltdowns caused by a link bounce in a large-scale hub-and-spoke network. Figure 2.4 provides a basic network for discussion.
该网络最初只有几个辐条或远程站点,并随着时间的推移而增长。随着新辐条的添加,状态数量不断增加;只要状态量缓慢攀升,控制平面就有足够的机会来适应可达性的微小变化。然而,路由器 A 的多点链路故障会导致整个分布式数据库立即被修改,这往往会超出控制平面的处理能力。考虑以下事件顺序:
This network starts out with just a few spokes or remote sites, and grows over time. As new spokes are added, the amount of state climbs; so long as the amount of state climbs slowly, the control plane has ample opportunity to adjust to the small changes in reachability. A failure of the multipoint link at Router A, however, causes the entire distributed database to be revised at once—often overwhelming the ability of the control plane to cope. Consider the following sequence of events:
• RouterA 的链路发生故障,导致所有邻居邻接关系同时发生故障。
• The link at Router A fails, causing all the neighbor adjacencies to fail at the same time.
• 链路恢复,导致连接到中心辐射网络的所有路由器尝试与中心路由器路由器 A 建立邻接关系。
• The link is recovered, causing all of the routers connected to the hub and spoke network to attempt building an adjacency with Router A, the hub router.
• 一些分支路由器成功开始与路由器 A 形成邻接关系。
• Some number of spoke routers successfully begin to form an adjacency with Router A.
• 已开始与路由器A 成功形成邻接关系的分支路由器将其完整的路由表发送至集线器,以完成这些邻接关系的形成。此信息淹没了路由器 A 的输入队列,导致 hello 和其他邻接形成数据包被丢弃。
• The spoke routers that have begun to successfully form an adjacency with Router A send their complete routing tables toward the hub to complete the formation of these adjacencies. This information overwhelms the input queue at Router A, causing hello and other adjacency formation packets to be dropped.
• 这些丢弃的数据包导致邻接形成过程中止,从而重新开始循环。
• These dropped packets cause the adjacency formation process to abort, restarting the cycle.
解决这种不断尝试形成大量邻接而导致所有邻接失败、导致在短时间内又一轮尝试形成大量邻接的可能方法是减慢该过程。首先只允许少量的分支路由器形成邻接。一旦一组邻接成功形成,就允许另一组分支路由器形成邻接。将允许形成邻接的分支路由器集分解为小组可以控制整个网络的信息流,使其低于中心路由器可以处理的水平。请注意,将邻接结构分解成小组模拟了网络最初构建的过程——随着时间的推移,分成更小的块。
One possible way to resolve this constant attempt at forming a large number of adjacencies that cause all the adjacencies to fail, causing another round of attempting to form a large number of adjacencies in a short period of time, is to slow the process down. Start by allowing only a small number of spoke routers to form an adjacency. Once one set of adjacencies is formed successfully, allow another set of spoke routers to form an adjacency. Breaking the set of spoke routers allowed to form an adjacency into small groups controls the information flow across the network, keeping it below the level the hub router can process. Note that breaking the adjacency formation down into small groups emulates the process by which the network was built in the first place—in smaller chunks, over time.
如果将路由协议视为分布式、近实时的数据库,那么数据库收敛所需的时间实际上就是数据库不一致的时间。EIGRP 的示例尤其令人心酸:EIGRP 活动计时器是您愿意允许网络保持不收敛的时间量,因此(在 EIGRP 的特定情况下)是您愿意允许数据包通过的时间长度。被丢弃而不是转发到最终目的地。BGP 与此类似,尽管流量通常不是最优路由而不是丢弃,并且抖动和延迟结果符合预期。对于链路状态协议,称为控制平面的分布式数据库保持不一致的时间量是流量可以循环或丢弃的时间量(取决于拓扑更改的类型和处理顺序)。状态可以分解为更小的块,以便更有效地处理,如中心和分支网络故障的示例所示。
If the routing protocol is viewed as a distributed, near real-time database, then the amount of time it takes for the database to converge is actually the amount of time the database is inconsistent. The example from EIGRP is particularly poignant: the EIGRP active timer is the amount of time you’re willing to allow your network to remain unconverged, and hence (in the specific case of EIGRP), how long you’re willing to allow packets to be dropped rather than forwarded to their final destination. BGP is similar to this, although traffic is more often routed suboptimally rather than dropped, with the expected results on jitter and delay. For a link state protocol, the amount of time the distributed database called the control plane remains inconsistent is the amount of time traffic can either be looped or dropped (depending on the type of topology change and the order of processing). State can be broken up into smaller chunks to be dealt with more efficiently, as shown in the example of the hub and spoke network failure.
在大多数情况下,变化的速度实际上比系统中的状态数量更能预测网络故障。只要状态相对静态,就不会产生“在线”成本或在网络节点上进行处理;静态状态主要是转发(或数据)平面而不是控制平面的成本。让我们看一下状态变化速度对网络收敛产生重大影响的两个示例。
The speed of change is, in most cases, actually a stronger predictor of network failure than the sheer amount of state in the system. So long as the state is relatively static, it isn’t costing “on the wire” or processing on the network nodes; the static state is mostly a cost in the forwarding (or data) plane rather than in the control plane. Let’s look at two examples of speed of state change having a major impact on network convergence.
让我们从一个简单的问题开始:全球互联网需要多长时间才能融合?换句话说,如果您从上游提供商的网络中的某个随机边缘对等点删除一条路由,“互联网的其余部分”需要多长时间才能发现该目的地不再可达?为了使问题更简单,我们假设该路由在第 3 层提供商的边缘被删除,而不是在第 1 层提供商的边缘,因此该路由必须传播到四个 AS 才能“随处”被删除(实际上,跳数由于 AS 跳数的长尾分布,会比这个长,但此示例将坚持使用四跳,因为它是接近平均值的整数)。
Let’s begin with a simple question: how long does it take for the global Internet to converge? In other words, if you remove a route from some random edge peering point, from an upstream provider’s network, how long will it for “the rest of the Internet” to discover that this destination is no longer reachable? To make the problem simpler, let’s assume that this route is removed at the edge of a tier 3 provider, rather than a tier 1 provider, so the route must propagate across four AS to be removed “everywhere” (in reality, the hop count would be longer than this because of the long tail distribution of AS hop count, but this example will stick with four hops here because it’s a round number close to the average).
那么,收敛时间问题就变成了“一条 BGP 路由需要多长时间才能传播通过四个自治系统?” 基于大量研究和实验室工作,答案实际上相当简单:在没有路由衰减的情况下,BGP(大致)根据以下公式收敛:
The convergence time question becomes, then—“how long does a BGP route take to propagate through four autonomous systems?” The answer is actually fairly simple, based on a lot of research and lab work: given no route dampening, BGP (roughly) converges based on the formula:
收敛 =(最大 AS 路径长度 – 最小 AS 路径长度)*MRAI
Convergence = (Max AS Path Length – Min AS Path Length)*MRAI
其中 MRAI 是最小路由通告间隔,或者是在通告有关特定目的地的信息之后、通告有关同一目的地的最新信息之前的时间量。如果该路由最初具有 4 跳的 AS 路径,并且该路由现在无法到达,并且 MRAI 为 30 秒,那么根据经验,大约需要 2 分钟才能从全局表中删除该路由。
where MRAI is the minimum route advertisement interval, or the amount of time after advertising information about a particular destination before advertising more recent information about that same destination. If the route originally had an AS Path of 4 hops, and the route is now unreachable, and the MRAI is 30 seconds, then it will take, as a rule of thumb, around 2 minutes to remove the route from the global table.
如果路由被注入而不是被删除怎么办?此案例涉及整个全球互联网中每个 BGP 发言者(内部和外部、端到端)接收新路由信息、处理该信息,然后将其发送到其对等方所需的时间。MRAI 影响此过程,如图2.5所示。
What if the route is injected instead of being removed? This case deals with the time it takes for each BGP speaker—both internal and external, end to end in the entire global Internet—to receive the new routing information, process it, and then send it on to its peers. The MRAI impacts this process as illustrated in Figure 2.5.
在此网络中,假设 BGP 通告的顺序可能最差,当 2001:db8:0:2::/64 首次连接到路由器 F 时:
In this network, assuming the worst possible ordering of BGP advertisements, when 2001:db8:0:2::/64 is first connected to Router F:
1. RouterF将新的目的地发布给RouterE,RouterE再将这条新路由发布给RouterC。RouterC将这条新路由发布给RouterB,RouterB再发布给RouterA。此时,RouterB设置MRAI为这个目的地。
1. Router F advertises the new destination to Router E, which then advertises this new route to Router C. Router C advertises this route to Router B, which then advertises the route to Router A. At this point, Router B sets the MRAI for this destination.
2.路由器 F 还将这个新目的地通告给路由器 D,然后路由器 D 将同一目的地通告给路由器 B。该通告在路由器 C 通告之后不久到达,并赢得了路由器 B 的最佳路径计算。路由器 B,但是,在 MRAI 定时器到期之前,无法将这条新路由通告给 RouterA。一旦 MRAI 定时器到期,路由器 B 就会向路由器 A 通告这条较短的路径。
2. Router F also advertises this new destination to Router D, which then advertises the same destination to Router B. This advertisement arrives just a few moments after the advertisement from Router C, and wins the best path calculation at Router B. Router B, however, cannot advertise this new route to Router A until the MRAI timer expires. Once the MRAI timer expires, Router B advertises this shorter path to Router A.
3. RouterF 也直接向RouterB 发布2001:db8:0:2::/64。该通告在 MRAI 定时器到期后立即到达路由器 B,路由器 B 已将经过路由器 D 的路径通告给路由器 A,并重置了 MRAI 定时器。路由器 B 现在必须等待 MRAI 计时器(再次)到期,然后才能向路由器 A 通告新的(且更短的)路径。
3. Router F also advertises 2001:db8:0:2::/64 to Router B directly. This advertisement reaches Router B just moments after the MRAI timer expires, Router B has advertised the path through Router D to Router A, and has reset the MRAI timer. Router B must now wait until the MRAI timer expires (again) before it can advertise the new (and shorter) path to Router A.
这一系列事件已在实时网络(例如全球互联网)中观察到。在这种情况下,MRAI 导致新到达目的地的通告需要几分钟,而不是几秒钟。
This sequence of events has been observed in live networks (such as the global Internet). The MRAI, in this instance, causes the advertisement of newly reachable destinations to take minutes, rather than seconds.
现在,另一方面,全球互联网多久发生一次变化?图 2.6是在撰写本文时取自potaroo.net的图表,该网站测量全局路由表的状态。4
Now, on the other side, how often do changes happen on the global Internet? Figure 2.6 is a chart taken from potaroo.net, a site that measures the state of the global routing table, at the time of writing.4
4 . “BGP 不稳定报告”,np,2014 年 8 月 24 日访问,http://bgpupdates.potaroo.net/instability/bgpupd.html。
4. “The BGP Instability Report,” n.p., accessed August 24, 2014, http://bgpupdates.potaroo.net/instability/bgpupd.html.
以下是一些观察结果:
Several observations are in order:
• 纵轴表示全局(默认自由区域)互联网路由表每秒的变化率。平均每秒变化 15 到 30 次,峰值最高可达每秒 50 次变化。
• The vertical axis represents the rate of change in the global (default free zone) Internet routing table per second. The average seems to be between 15 and 30 changes per second, with peaks that reach as high as 50 changes per second.
• 水平轴代表时间,并在此数据呈现中按跨天发生的情况进行分割。检查多年来测量的可用信息表明,这种模式存在多年。
• The horizontal axis represents time, and is split up in this rendition of the data as occurring across days. Examining the information available across years of measurements indicates that this pattern exists across many years.
从相同的数据可以看出,全球互联网表中任何特定变化的收敛速度都是以秒或分钟为单位来衡量的(70 到 80 秒似乎是平均值)。
From the same data, it’s apparent that the rate of convergence for any particular change in the global Internet table is measured in seconds or minutes (70 to 80 seconds would seem to be the average).
很难将路由表中每秒发生 15 到 50 次更改的网络称为真正意义上的收敛。事实上,互联网从未真正融合——而且已经很多年没有真正融合了。那么,互联网控制平面如何提供足够可靠的可达性信息以使互联网本身正常运行呢?为什么互联网的控制平面不会像任何正常网络中所预期的那样“崩溃”?互联网控制平面的稳定性主要归因于“核心”的相对稳定性(将大部分状态变化移至网络边缘)以及任何特定 AS 中内部和外部路由信息之间的严格划分。后一点将在几页的“表面”部分中进行讨论。
It’s difficult to call a network with 15 to 50 changes per second in its routing table converged in any meaningful sense of the word. In fact, the Internet doesn’t ever really converge—and it hasn’t really converged in years. How, then can the Internet control plane provide reachability information reliable enough to make the Internet itself work? Why doesn’t the Internet’s control plane “crash,” as might be expected in any normal network? Primarily the stability of the Internet’s control plane is due to the relative stability of the “core,” which moves most state changes to the edges of the network, and the strong division between internal and external routing information in any specific AS. This latter point is considered in the section, “Surfaces,” in a few pages.
振荡链路不再像以前那样常见,尤其是广域链路,但它们对收敛的影响可能是毁灭性的。图 2.7给出了一个网络供参考。
Flapping links are not as common as they used to be, particularly wide area links, but they can be devastating in their impact on convergence. Figure 2.7 illustrates a network for reference.
对于路由器 A 和 B 之间的链路的每次抖动,路由器 B、C、D 和 E 都会收到数千个更新。然而,路由器 F 在同一时间段内始终收到三倍多的更新——这是众多缺陷之一具有分布式控制平面的大规模并行拓扑。在实际情况中,路由器 F 可能会发生故障,导致网络无法收敛。根据该网络中故障域的配置和范围,路由器 F 的故障(或者甚至无法跟上不断更新的拓扑信息流)可能会渗透到路由器 F 之外的网络中,从而导致一般控制平面故障。
For each flap of the link between Routers A and B, Routers B, C, D, and E receive thousands of updates. Router F, however, consistently receives three times as many updates in the same time period—one of the many downfalls of massively parallel topologies with distributed control planes. In real-world situations, Router F could fail in a way that prevents the network from ever converging. Depending on the configuration and scope of the failure domains in this network, Router F’s failure (or even its inability to keep up with this constant flow of updated topology information) could bleed into the network beyond Router F, causing a general control plane failure.
振荡链路的速度与可倍增拓扑更新速度的并行链路的结合对于路由协议来说可能是致命的。
The combination of the speed of a flapping link and parallel links that multiply the speed of topology updates can be fatal to a routing protocol.
杀死控制平面的并不是变化的速度,而是信息变化速度的不可预测性以及每个时间片中变化的信息量。变化率越随机,变化的信息量越随机,围绕变化进行计划就越困难。大多数网络工程师在设计网络时都会考虑事物互连的方式、服务应放置的位置以及如何使事情对操作人员来说“更简单”。通常不考虑控制平面的稳定性;这只是一个“给定”。
It’s not so much the speed of change that kills control planes, it’s the unpredictability of the speed at which information changes combined with the amount of information changing in each time slice. The more random the rate of change, and the more random the amount of information changing, the harder it is to plan around the changes. Most network engineers, when designing a network, consider the way in which things should be interconnected, where services should be placed, and how to make things “simpler” for the human operator. What isn’t often considered is the stability of the control plane; it’s just a “given.”
处理网络复杂性时,速度是需要考虑的关键点;更快通常意味着更复杂。
Speed is a crucial point to consider when dealing with network complexity; faster generally means more complex.
还有一个想法与状态量和变化速度混合在一起,要么放大要么减弱两者,从而影响控制平面的稳定性:不同组件或系统相互作用的表面。理解复杂系统中的交互表面涉及三个基本概念:图 2.8说明了前两个。
There is one more idea that mixes with the amount of state and the speed of change, either amplifying or dampening both, and hence impacting control plane stability: the surfaces across which different components or systems are interacting. Three basic concepts are involved in understanding interaction surfaces in complex systems; Figure 2.8 illustrates the first two.
交互界面有两个基本概念:
There are two basic concepts in interaction surfaces:
•互动深度。交互深度可以看作两个系统或组件交互的强度。例如,如果一个组件依赖于另一个组件以特定方式格式化数据,则这两个组件必须并行地更改特定数据的格式化方式。当格式化数据的组件发生变化时,需要知道数据如何格式化的组件也必须发生变化。在网络架构领域,这可以看作是两个不同控制平面之间的交互,或者是在流经网络时检查数据包的格式。单个数据包格式更改可能会导致数百或数千个设备需要更新才能正确读取新的数据包格式。图 2.8中的 A 点和 C 点说明所示的两个组件之间的深层交互。
• Interaction Depth. The depth of interaction can be seen as how strongly the two systems or components interact. For instance, if one component relies on another component formatting data in a specific way, the two components must change in terms of the way that specific piece of data is formatted, in parallel. As the component that formats the data is changed, the component that needs to know how the data is formatted must also change. In the world of network architecture, this can be seen as the interaction between two different control planes, or the formatting of packets being inspected as they flow through the network. A single packet change format can cause hundreds or thousands of devices to require updates to read the new packet format correctly. Points A and C in Figure 2.8 illustrate deep interaction between the two components shown.
•互动广度。两个系统或组件“接触”的地方的数量可以称为交互表面的宽度。两个系统或组件交互的地方越多,它们就越会形成一个更复杂的系统。在图2.8中,两个所示系统有三个接触点;其中一个(C 点)比其他两个更宽,表示沿单个任务(或一组任务)定位的多个接口。
• Interaction Breadth. The number of places where two systems or components “touch” can be called the breadth of the interaction surface. The more places two systems or components interact, the more they will form a single, more complex system. In Figure 2.8, there are three points at which the two illustrated systems touch; one of these (point C) is wider than the other two, representing a number of interfaces located along a single task (or set of tasks).
可以使用一个示例来说明这两个概念:在整个网络的一组路由器上配置两个路由协议。
A single example can be used to illustrate both concepts: two routing protocols configured on a set of routers throughout a network.
• 单个路由器上配置的每个协议都通过共享资源(例如内存和处理器)与同一路由器上的任何其他协议进行交互。安装在单个路由器上的多个协议也共享一个公共 RIB(或一组 RIB),因此一个协议删除路由可能会导致第二个协议做出反应 — 也许会通告一个路由替换路线,或删除只能通过(现已删除)目的地到达的目的地的可达性。第一组交互——共享资源的竞争——通常可以被认为是范围狭窄的浅层交互表面。第二组交互(通过共享 RIB 共享和交互可达性信息)可以被视为稍宽但仍然较浅的交互表面。
• Each protocol configured on a single router interacts with any other protocols on that same router through shared resources, such as memory and processor. Multiple protocols installed on a single router also share a common RIB (or set of RIBs), so that the removal of a route by one protocol can cause a reaction in the second protocol—perhaps advertising a replacement route, or removing reachability to a destination that was only reachable through the (now removed) destination. The first set of interactions—competition for shared resources—can generally be considered a narrowly scoped shallow interaction surface. The second set of interactions—shared and interactive reachability information through a shared RIB—can be considered a slightly broader, but still shallow, interaction surface.
• 在网络中的每个路由器上配置两种路由协议可拓宽交互面,因为两种协议共享处理器和内存以及通过共享可达性信息进行交互的情况更多。这看起来复杂性并没有大幅增加,但随着网络中的每个路由器都运行这两种协议,任一协议中的单一故障导致大规模中断的机会就会增加。
• Configuring both routing protocols on every router in the network broadens the interaction surface, as there are more instances where the two protocols share processor and memory, as well as interacting through shared reachability information. This might not appear to be a large increase in complexity, but with every router in the network running both protocols, the opportunities for a single failure in either protocol to cause a large-scale outage is increased.
• 重新分发两个协议以在网络中的某一点(在一台路由器上)重新分发可达性信息,会增加该路由器上的交互深度,从而稍微增加复杂性。
• Redistributing the two protocols to redistribute reachability information at one point in the network (on one router) increases the depth of interaction on that one router, increasing the complexity by some small amount.
• 重新分配两个协议以重新分配网络中每个路由器上的可达性信息,增加了交互表面整个广度上的交互深度。这意味着两个协议之间的交互表面的复杂性大大增加。
• Redistributing the two protocols to redistribute reachability information on every router in the network increases the depth of interaction across the entire breadth of the interaction surface. This represents a large increase in the amount of complexity through the interaction surface between the two protocols.
路由协议相互依赖或交互越多,交互面就越深。路由协议交互的地方越多,交互面就越广。
The more the routing protocols rely on one another, or interact, the deeper the interaction surface. The more places the routing protocols interact, the broader the interaction surface.
图2.9说明了交互表面涉及的第三个基本概念:重叠交互。
Figure 2.9 illustrates the third basic concept involved in interaction surfaces: overlapping interactions.
在图 2.9中,组 A 说明了两个重叠或交互的组件或系统,而组 B 说明了三个。随着越来越多的组件或系统在一组接口中进行交互,整个系统变得更加复杂。网络工程中的一个例子是发送数据包的设备与处理数据包通过网络时的设备之间的交互。
In Figure 2.9, set A illustrates two components or systems that overlap, or interact, while set B illustrates three. As more components or systems interact in a single set of interfaces, the overall system becomes more complex. An example of this in network engineering is, again, the interaction between devices that send packets and devices that process packets as they pass through the network.
• 如果数据包只是根据目标地址转发,则路径上的每个路由器实际上都在与发送和接收主机进行交互,但只是以非常浅的方式进行交互。因此,交互重叠度很高,但交互深度很浅。交互的广度取决于数据包从源到目的地必须经过的跳数。
• If a packet is simply forwarded based on the destination address, then each of the routers along the path is actually interacting with the sending and receiving hosts, but only in a very shallow way. Hence, the interaction overlap is high, but the depth of interaction is very shallow. The breadth of interaction would depend on the number of hops through which the packet must pass to travel from the source to the destination.
• 如果一个数据包由一台需要维护数据包之间的状态以及返回流量的设备进行检查,则发送方、接收方和控制平面之间的交互深度相当高(因为任何数据包格式更改或任何网络路径更改都会影响数据包格式)。需要由状态数据包检测设备考虑)。但如果网络中只有一个地方发生这种交互,那么就只有三个系统相交。
• If a packet is inspected by one device that requires state maintained between packets, and about return traffic, the interaction depth is fairly high between the sender, receiver, and control plane (as any packet format changes, or any network path changes, will need to be accounted for by the stateful packet inspection device). But if there is only one place in the network where this interaction is taking place, then there are only the three systems intersecting.
• 网络中添加状态数据包检测的每个点都会增加交互系统的数量,从而增加交互的重叠。
• Each point at which stateful packet inspection is added in the network increases the number of systems interacting, and hence the overlapping interactions.
重叠的组件或系统越多,整个系统就越复杂。
The more overlapping components or systems, the more complex the overall system is.
为了在现实世界条件下提供潜在的稳健性,复杂性是必要的——重复奥尔德森和多伊尔在第一章“定义复杂性”中首次遇到的陈述:
Complexity is necessary to provide the underlying robustness in real-world conditions—to repeat a statement by Alderson and Doyle first encountered in Chapter 1, “Defining Complexity”:
具体来说,我们认为,高度组织化的系统的复杂性主要源于旨在为其环境和组件的不确定性创造稳健性的设计策略。5
Specifically, we argue that complexity in highly organized systems arises primarily from design strategies intended to create robustness to uncertainty in their environments and component parts.5
5 . David L. Alderson 和 John C. Doyle,“复杂性的对比观点及其对以网络为中心的基础设施的影响”,IEEE Transactions on Systems, Man, and Cybernetics 40,no。4(2010 年 7 月):840。
5. David L. Alderson and John C. Doyle, “Contrasting Views of Complexity and Their Implications for Network-Centric Infrastructures,” IEEE Transactions on Systems, Man, and Cybernetics 40, no. 4 (July 2010): 840.
工程师面临着管理复杂性的问题。有一个简单的模型在自然界中普遍存在,并且在工程界被广泛模仿。虽然工程师并不经常有意识地应用这个模型,但它实际上一直在使用。这个型号是什么?图 2.10说明了沙漏模型。
Engineers are left with the problem of managing complexity. There is a simple model that is ubiquitous throughout the natural world, and is widely mimicked in the engineering world. While engineers don’t often consciously apply this model, it’s actually used all the time. What is this model? Figure 2.10 illustrates the hourglass model.
例如,考虑网络世界中广泛使用的四层模型。图 2.11将常用的网络协议四层模型与图 2.10所示的沙漏模型进行了比较。
As an example, consider the four-layer model used so widely in the networking world. Figure 2.11 compares the commonly used four-layer model for network protocols to the hourglass model shown in Figure 2.10.
在底层,即物理传输系统,有各种各样的协议,从以太网到卫星。在顶层,信息被编组并呈现给应用程序,有各种各样的协议,从 HTTP 到 TELNET(以及数千种其他协议)。然而,当您向堆栈中间移动时,会发生一件有趣的事情:协议数量减少,形成沙漏。为什么这可以控制复杂性?回顾复杂性的三个组成部分——状态、速度和表面——揭示了沙漏和复杂性之间的关系。
At the bottom layer, the physical transport system, there are a wide array of protocols, from Ethernet to Satellite. At the top layer, where information is marshalled and presented to applications, there are a wide array of protocols, from HTTP to TELNET (and thousands of others besides). However, a funny thing happens when you move toward the middle of the stack: the number of protocols decreases, creating an hourglass. Why does this work to control complexity? Going back through the three components of complexity—state, speed, and surface—exposes the relationship between the hourglass and complexity.
•状态按沙漏分为两种不同类型的状态:有关网络的信息和有关通过网络传输的数据的信息。上层关注以可用的方式编组和呈现信息,而下层关注发现存在什么连接以及该连接的实际属性是什么。下层不需要知道如何格式化 FTP 帧,上层也不需要知道如何通过以太网传输数据包——模型两端的状态都会减少。
• State is divided by the hourglass into two distinct types of state: information about the network, and information about the data being transported across the network. While the upper layers are concerned with marshalling and presenting information in a usable way, the lower layers are concerned with discovering what connectivity exists and what the properties of that connectivity actually are. The lower layers don’t need to know how to format an FTP frame, and the upper layers don’t need to know how to carry a packet over Ethernet—state is reduced at both ends of the model.
•通过隐藏层之间的信息来控制速度。正如相同信息的并行副本可以是“速度倍增器”一样,隐藏信息可以是“减速器”,或者可能是一组制动器。如果可以在一层中处理信息而不涉及另一层的状态,则向任何特定层呈现新状态的速度就会降低。例如,提供给 FTP 客户端的数据中的错误不会导致 TCP 状态发生变化,更不用说以太网链路状态了。
• Speed is controlled by hiding information between layers. Just as parallel copies of the same information can be a “speed multiplier,” hiding information can be a “speed reducer,” or perhaps a set of brakes. If information can be handled at in one layer without involving the state of another layer, then the speed at which a new state is presented to any particular layer is reduced. For instance, an error in the data presented to an FTP client doesn’t cause a change of state in the state of TCP, much less in the state of the Ethernet link.
•通过将各个组件之间的交互点数量减少到精确的一个(IP)来控制表面。可以通过标准流程很好地定义这一单一交互点,并严格监管一个交互点的变化,以防止反映协议栈上下的大规模快速变化。
• Surfaces are controlled by reducing the number of interaction points between the various components to precisely one—IP. This single interaction point can be well defined through a standard process, with changes in the one interaction point closely regulated to prevent massive rapid changes that will reflect up and down the protocol stack.
因此,堆叠网络模型的分层是控制网络各个交互组件的复杂性的直接尝试。
The layering of a stacked network model is, then, a direct attempt to control the complexity of the various interacting components of a network.
有关网络模型的更多信息,请参阅Cisco Press 标题《网络架构的艺术》中的第 4 章“模型” 。
For more information on network models, see Chapter 4, “Models,” in the Cisco Press title, The Art of Network Architecture.
虽然本书将使用状态、速度和表面来描述复杂性,但工程师通常需要考虑第四个组件——优化。很多时候,复杂性是对优化的权衡。增加复杂性会增加网络的优化,降低复杂性会降低网络的优化。这一原理的例证可以在检查事件驱动反应和定时器驱动反应对网络变化的反应之间的选择中找到。图 2.12说明了定时器和事件驱动的检测。
While state, speed, and surface will be used to describe complexity throughout this book, there is a fourth component engineers often need to take into account—optimization. Quite often, complexity is a tradeoff against optimization; increasing complexity increases the optimization of the network, and reducing complexity reduces the optimization of the network. An illustration of this principle can be found in examining the choice between event-driven reactions and timer-driven reactions to changes in the network. Figure 2.12 illustrates timer- and event-driven detection.
在图2.12中,从左到右显示了时间线。随着时间的推移,两个 OSPF 进程会定期交换 hello 数据包——这是定时器驱动检测系统的典型示例。如果两个 OSPF 进程之间的链路发生故障,两个进程将通过丢失三个 hello 数据包来识别故障,从而导致邻接失败,并且通过丢失的邻居获知的路由将从本地数据库和路由表中删除。第二个检测过程也通过链路载体、接口驱动程序、路由表并进入 OSPF 过程 1 来显示。这是一个事件驱动的检测链:
In Figure 2.12, a timeline is shown from left to right. Over time, two OSPF processes are exchanging periodic hello packets—a classic example of a timer-driven detection system. If the link between the two OSPF processes fails, the two processes will recognize the failure through a loss of three hello packets, causing the adjacency to fail and the routes learned through the lost neighbor to be removed from the local database and routing table. A second detection process is also shown through the link carrier, interface driver, routing table, and into OSPF Process 1. This is an event-driven detection chain:
• 如果链路出现故障,物理接口上的载波检测将会失败。这将导致物理接口向接口驱动程序发出已发生故障的信号。
• If the link fails, carrier detection on the physical interface will fail. This will cause the physical interface to signal the interface driver that the failure has occurred.
• 当接口驱动程序收到通知后,它将向路由子系统发出信号,路由子系统将从任何受影响的 RIB 中删除通过现在发生故障的接口可到达的任何路由。
• When the interface driver is notified, it will then signal the routing subsystem, which will remove any routes reachable through the now failed interface from any impacted RIB.
• 当RIB 删除受影响的路由(包括连接的接口路由)时,它将向OSPF 进程1 发出故障信号。这将导致 OSPF 进程从其本地表中删除通过该接口可到达的任何邻居,以及从其本地表中从该邻居获悉的任何链路状态数据库条目。
• When the RIB removes the effected routes, including the connected interface route, it will signal OSPF Process 1 of the failure. This will cause the OSPF process to remove any neighbors reachable through that interface from its local tables, and any link state database entries learned from this neighbor from its local tables.
在此示例中,事件驱动的检测更为复杂,因为事件必须通过多个接口才能到达 OSPF 进程 1。这些接口中的每一个都意味着必须管理一个交互表面;事实上,这个交互表面可能很深,因为 OSPF 进程可能需要根据链路类型、故障类型或较低层提供的其他信息做出不同的反应。事件驱动的检测还提高了控制平面状态变化的速度;每个链路抖动都可以单独记录在链路上运行的 OSPF 进程中,并且这些抖动很可能以拓扑更新的形式在整个控制平面中传输。基于定时器的系统要简单得多;OSPF 进程对用于传输 hello 数据包的底层网络了解不多,
Event-driven detection is more complex in this example, as the event must pass through multiple interfaces to reach OSPF Process 1. Each of these interfaces implies an interaction surface that must be managed; this interaction surface may, in fact, be deep, as the OSPF process may need to react differently depending on the link type, the type of failure, or other information provided from the lower layers. Event-driven detection also increases the speed of state change in the control plane; each link flap may be individually recorded in the OSPF process running over the link, and these flaps could well be transmitted throughout the control plane in the form of topology updates. The timer-based system is much simpler; the OSPF processes don’t have a lot of knowledge about the underlying network being used to transport the hello packets, and the state of the adjacency is changed only at fixed intervals, dampening any potential feedback loops, and slowing down the rate of change in the control plane (speed).
在这个例子中,优化权衡也应该很清楚。事件驱动的检测过程将更快地发现下行链路,从而使控制平面能够非常快速地通过链路路由来对故障做出反应。这减少了平均修复时间 (MTTR),从而提高了整体网络可用性。在这种情况下,更复杂的事件驱动过程会增加优化,而选择降低复杂性也会导致网络优化的减少。
The optimization tradeoff should be clear in this example, as well. The event-driven detection process will discover the downlink much faster, allowing the control plane to react to the failure by routing around the link—for instance—very quickly. This reduces the Mean Time to Repair (MTTR), and hence increases the overall network availability. In this case, then, the more complex event-driven process increases optimization, while opting for reduced complexity also incurs a reduction in network optimization.
增加优化并不总是需要增加复杂性,或者尝试降低复杂性总是会降低优化,但这种情况很常见。这种权衡的例子遍布本书各处。
It isn’t always going to be the case that increasing optimization will require increasing complexity, or attempts to reduce complexity will always decrease optimization—but it is quite common. Examples of this tradeoff are scattered throughout this book.
状态、速度、表面以及优化——如果您能够围绕这四个组成部分进行思考,您就可以牢牢掌握网络复杂性所涉及的问题。从未真正融合的网络正在成为常态而不是例外;传统模式正在崩溃。如果工程师能够学习如何识别复杂性并在网络工程难题的各个方面对其进行管理,那么沙漏模型就提供了一条穿越复杂性泥沼的前进之路。
State, speed, and surface, and optimization—if you can get your thinking around these four components, you can get a solid grip on the problems involved in network complexity. Networks that never truly converge are becoming the norm rather than the exception; the traditional models are breaking down. The hourglass model provides a way forward through the complexity morass, if engineers can learn how to recognize complexity and manage it in all the various pieces of the network engineering puzzle.
考虑到复杂性的这四个基本方面——状态、速度、表面和优化——只有测量这四个点并生成一个描述给定设计和部署结构的整体复杂性的数字才有意义。如果有某种方法可以检查提议的网络设计或对网络设计的提议更改,并且能够为每个组件的复杂性分配实际数字,以便可以将复杂性与任何潜在的性能增益进行比较,那就太好了,或者可以将一个区域的复杂性损失与另一区域的复杂性增益进行比较。如果事情就这么简单就好了。
Given these four fundamental aspects of complexity—state, speed, surface, and optimization—it only makes sense to measure these four points and generate a single number describing the overall complexity of a given design and deployment structure. It would be nice if there were some way to examine a proposed network design, or a proposed change to a network design, and be able to assign actual numbers to the complexity of each component so the complexity can be compared to any potential gain in performance, or the loss of complexity in one area can be compared to the gain in complexity in another. If it were only that simple.
事实证明,衡量复杂性的工作本身就相当复杂。
As it turns out, the effort to measure complexity is, itself, quite complex.
在检查测量和量化网络的问题以了解整个系统的复杂性时,两个问题浮出水面。首先,可用信息量巨大。鉴于当前对大数据分析的推动,以及测量数千到数百万次交互和数据挖掘以发现重要趋势和工件的能力,平均网络规模的问题难道不是一个简单的问题吗?考虑一些不同的测量点,只是为了了解流经网络中每个点的数据与用于处理该流量的排队机制之间的交互。这可能包括以下内容:
Two problems rise to the surface when examining the problem of measuring and quantifying a network toward gaining an understanding of the overall system complexity. First, there is the sheer amount of information available. Given the current push toward big data analytics, and the ability to measure thousands to millions of interactions and data mining to discover important trends and artifacts, shouldn’t something the size of an average network be an easy problem? Consider some of the various points of measurement just in trying to understand the interaction between the data flowing through each point in the network and the queuing mechanisms used to handle that traffic. This might include things such as:
• 流经网络中每个点的数据量,包括每个转发设备的输入和输出队列。
• The amount of data flowing through each point in the network, including the input and output queue of each forwarding device.
• 网络中每个转发设备的每个队列的深度和状态。
• The depth and state of each queue of each forwarding device in the network.
• The source, destination, and other header information of each packet forwarded through the network.
• 每个转发设备丢弃的数据包数量,包括丢弃的原因(尾部丢弃、数据包错误、已过滤、过滤规则等)。
• The number of packets dropped by each forwarding device, including the reason why they were dropped (tail drop, packet error, filtered, filtering rule, etc.).
考虑到测量本身必须通过网络,并且测量很容易包含比测量流量更多的流量,测量所有内容的问题很快就会变得明显。如果测量与您正在测量的内容在同一通道上进行,如何将测量与被测量分开?除此之外,每个控制平面系统的状态以及这些系统的组件也增加了这一挑战,例如每个转发设备的内存和处理器利用率、参与控制平面的每对设备之间的每个邻接状态,以及控制平面内每个可达性广告的流程。为了使测量系统复杂性变得更加复杂,系统之间的交互也必须以某种方式考虑在内——诸如可达性信息对策略的分发和应用的影响、并行控制平面之间在可达性信息和系统资源方面的任何相互依赖性,以及覆盖层和底层控制之间的交互飞机。不仅测量系统,而且测量系统之间的交互很快就成为一个棘手的问题。
Considering that the measurements themselves must pass through the network—and the measurements can easily contain more traffic than the measured traffic—the problems with measuring everything should quickly become apparent. How can you separate the measurement from the measured if the measurement is being carried on the same channel as what you are measuring? Added to this challenge are the states of each individual control plane system, and the components of those systems—things like the memory and processor utilization of each forwarding device, the state of each adjacency between each pair of devices participating in the control plane, and the flow of each reachability advertisement within the control plane. To make measuring the system complexity even more complex, the interactions between the systems must also somehow be taken into account—things like the impact of reachability information on the distribution and application of policy, any interdependencies between parallel control planes in terms of reachability information and system resources, and interactions between overlay and underlay control planes. Measuring not only the systems but also the interactions between the systems quickly becomes an intractable problem.
当测量一个系统以了解其复杂程度时,必须进行某种采样。采样必然意味着必须排除一些信息,这反过来又意味着任何对复杂性的测量都必然是复杂性的抽象表示,而不是对复杂性本身的测量。
When measuring a system to understand its complexity level, some sort of sampling must take place. Sampling necessarily means that some information must be left out—which, in turn, means that any measurement of complexity along these lines is necessarily an abstract representation of the complexity, rather than a measure of the complexity itself.
最重要的是,对于创建网络复杂性的准确抽象表示所需测量的一组事物,几乎没有达成一致。
To top all of this complexity off, there is very little agreement on the set of things to measure to create even an accurate abstract representation of the complexity of a network.
继第一个问题之后,第二个问题迫在眉睫——这个问题并不那么明显,实际上使测量网络复杂性的问题变得棘手。网络设计代表有序(或有意或有组织——这三个术语经常互换使用)的复杂性,而不是无序的复杂性。虽然数据分析可以很好地处理无序数据,但有序复杂性是一个完全不同的问题集。
There is a second problem looming on the horizon just past this first one—a problem that’s not so obvious, and actually makes the problem of measuring network complexity intractable. Network design represents ordered (or intentional or organized—these three terms are often used interchangeably) complexity, rather than unordered complexity. While data analytics deals with unordered data well enough, ordered complexity is an entirely different problem set.
让我们首先检查一些用于测量网络复杂性的方法,然后考虑有序复杂性与无序复杂性。最后,将研究几个复杂的领域,这些领域将导致实际应用。
Let’s begin by examining some methods proposed to measure network complexity, and then consider ordered versus unordered complexity. Finally, several realms of complexity will be examined that will lead to practical applications.
这项任务的难度并没有阻止研究人员尝试测量网络复杂性。恰恰相反——多年来人们已经尝试过多种方法。这些方法中的每一种都为问题空间贡献了有用的思考,并且实际上可以用于提供对网络复杂性的一些了解。但总的来说,这些测量都无法真正提供网络复杂性的完整视图。
The difficulty of the task hasn’t stopped researchers from attempting to measure network complexity. Quite the opposite—there are a number of methods that have been tried over the years. Each of these methods has contributed useful thinking to the problem space, and can actually be used to provide some insight into what network complexity looks like. Overall, though, none of these measurements will truly provide a complete view of the complexity of a network.
让我们看一下网络复杂性测量的三个示例,以了解一下这个空间。
Let’s look at three examples of network complexity measurements to get a feel for the space.
Bailey 和 Grossman(通常称为 NCI)在“网络的网络的网络复杂性指数” 1中描述了网络复杂性指数。总体思路是分两步解决描述网络复杂性的问题:
The Network Complexity Index is described in “A Network Complexity Index for Networks of Networks”1 by Bailey and Grossman (commonly called the NCI). The general idea is to tackle describing network complexity in two steps:
1 . Stewart Bailey 和 Robert L. Grossman,“网络网络的网络复杂性指数”(Infoblox,2013 年),np,https: //web.archive.org/web/20131001093751/http ://flowforwarding.org/docs/ Bailey%20-%20Grossman%20article%20on%20network%20complexity.pdf。
1. Stewart Bailey and Robert L. Grossman, “A Network Complexity Index for Networks of Networks” (Infoblox, 2013), n.p., https://web.archive.org/web/20131001093751/http://flowforwarding.org/docs/Bailey%20-%20Grossman%20article%20on%20network%20complexity.pdf.
• 将网络分解为子网。正如原论文中所述:
• Break the network down into subnetworks. As described in the original paper:
给定一个网络 N,我们首先将网络划分为更小的子网络 C[1],…。。., C[j], . 。。, C[p] 具有以下属性:从子网 C[i] 中随机选择的两个节点比从子网外部随机选择的两个节点 (N\C) 更有可能相互连接。
Given a network N, we first divide the network into smaller sub-networks C[1], . . ., C[j], . . . , C[p] with the property that two nodes selected at random from the sub-network C[i] are more likely to be connected to each other than two nodes selected at random from outside the sub-network (N\C).
• 根据子网的大小和数量计算复杂性。再次,如原始论文中所述:
• Compute the complexity based on the size and number of the subnetworks. Again, as described in the original paper:
给定网络 N 的子社区,令 X[j] 表示第 j 个最大子社区的大小,使得序列 X[1],... 。。, X[p] 按降序排列。一般来说,不同的社区可能具有相同的规模。我们将网络 N 的网络复杂度指数 B(N) 定义为以下方程的解: B(N) = Max j, X[j] j
Given the sub-communities of the network N, let X[j] denote the size of the j largest sub-community, so that the sequence X[1], . . . , X[p] is in decreasing order. In general, different communities may have the same size. We define the network complexity index B(N) of the network N as the solution of the following equation: B(N) = Max j, X[j] j
给出的方程是用于评估科学研究重要性的标准统计数据,称为 H 指数。H 指数决定了影响一项特定的研究,通过评估作品的引用次数,类似于网页搜索索引,使用页面的链接数量来确定该页面的重要性或相关性。
The equation given is a standard statistic used in evaluating the importance of scientific research known as the H-index. The H-index determines the impact of a particular piece of research by evaluating the number of citations of the work in a way that is similar to a web page search index using the number of links to a page to determine the importance or relevance of that page.
从这个角度来看,NCI 尝试将网络内的连接性与网络内的节点数量结合起来:
Seen this way, the NCI attempts to combine the connectivity within a network with the number of nodes within a network:
• 子社区越多,这些子社区之间的连接点就越多,因此连接图也就越复杂。
• The more the subcommunities, the more connection points there must be between these subcommunities, and hence the more complex the connection graph must be.
• 子社区越大,网络内的节点就越多;这再次影响网络的隐含连接图
• The larger the subcommunities, the more nodes there are within the network; this again impacts the implied connectivity graph of the network
连接图的大小和范围反过来会影响信息在网络内流动的方式,这也与网络的复杂性有关。
The size and scope of the connectivity graph, in turn, impacts the way information flows within the network, which also relates to the complexity of the network.
NCI 很好地生成了一个数字,该数字描述了网络在节点和社区方面的规模和形状,进而暗示了网络互连的范围和复杂性。随着时间的推移计算出的这个单一数字可以帮助网络管理员和设计人员了解网络的增长情况,而不仅仅是单纯的规模。
The NCI does a good job of producing a single number that describes the size and shape of a network in terms of nodes and communities, in turn implying the scope and complexity of the network interconnections. This single number, computed over time, can help network managers and designers understand the growth of a network in terms other than sheer size.
从网络工程师的角度来看,使用 NCI 作为网络复杂性的单一衡量标准存在几个实际问题。首先,这不是你在吃晚饭时在餐巾纸上计算的东西,也不是在头脑中进行粗略计算的东西。2这是一项数学繁重的计算,需要自动化工具来计算。其次,除了测量拓扑的增长和互连性之外,很难看出 NCI 在现实世界中如何以及在何处有用。根据 NCI 的测量,除了减少网络中子社区的数量和大小之外,没有明显的方法可以降低网络复杂性。
From a network engineer’s perspective, there are several practical problems with using the NCI as a single measure of network complexity. First, this isn’t something you’re going to compute on a napkin while you’re eating dinner, or do rough calculations in your head around.2 This is a math heavy computation that requires automated tools to compute. Second, other than measuring the growth and interconnectedness of a topology, it’s hard to see how and where the NCI is useful in the real world. There’s no obvious way to reduce network complexity as measured by the NCI other than reducing the number and size of the subcommunities in the network.
2 . 事实上,一个名为 Tapestry 的整个项目是围绕通过自动收集配置并通过处理器运行它们来测量 NCI 构建的。该项目可以在 GitHub 上找到:https://github.com/FlowForwarding/tapestry。
2. In fact, an entire project called Tapestry was built around measuring the NCI by gathering configurations automatically and running them through a processor. The project can be found on GitHub at https://github.com/FlowForwarding/tapestry.
然而,第二个反对意见导致了 NCI 的另一个缺点:它并没有真正衡量网络运营商交互的复杂性。是相当在现实世界中,经常会发现非常大的网络仅支持少数工作负载,并且这些网络已针对该工作负载进行了大量优化,因此从工程师的角度来看并不是很复杂。在现实世界中,发现具有非常多样化的工作负载的小型网络也很常见,因此无法针对单个工作负载进行优化。这些网络比其规模所显示的更为复杂——NCI 可能会低估这些网络的复杂性。
This second objection, however, leads to another shortcoming of the NCI: it doesn’t really measure the complexity network operators interact with. It’s quite common, in the real world, to find very large networks supporting only a few workloads that have been heavily optimized for that workload, and hence are not very complex from an engineer’s point of view. It’s also quite common, in the real world, to find small networks with a very diverse workload, and hence cannot be optimized for a single workload. These networks are more complex than their size indicates—the NCI would likely underestimate the complexity of these networks.
那么 NCI 错过了什么?只是设计人员大多数时间处理的网络架构部分,例如:
So what does the NCI miss? Just those pieces of network architecture that designers deal with most of the time, such as:
• 策略,通过配置、指标、协议和其他方法表达
• Policy, expressed through configuration, metrics, protocols, and other methods
• 弹性,通过冗余量、快速收敛机制和其他高度复杂的设计组件来表示
• Resilience, expressed through the amount of redundancy, fast convergence mechanisms, and other highly complex design components
因此,虽然 NCI 很有用,但它并不能以可有效应用于现实世界网络的方式捕获单个网络的所有复杂性。
So while the NCI is useful, it doesn’t capture all the complexity of a single network in a way that can be usefully applied to real-world networks.
在向互联网研究任务组的网络复杂性研究小组展示的一组幻灯片中,一组研究人员描述了一个用于测量和描述企业路由设计复杂性的模型。3测量过程如下:
In a set of slides presented to the Internet Research Task Force’s Network Complexity Research Group, a group of researchers described a model for measuring and describing the complexity of enterprise routing design.3 The process of measurement is as follows:
3 . Xin Sun、Sanjay G. Rao 和 G.Xie Geoffrey,“企业路由设计的复杂性建模”(IRTF NCRG,2012 年 11 月 5 日),np,2014 年 10 月 5 日访问,http: //www.ietf.org/诉讼程序/85/slides/slides-85-ncrg-0.pdf。
3. Xin Sun, Sanjay G. Rao, and G.Xie Geoffrey, “Modeling Complexity of Enterprise Routing Design” (IRTF NCRG, November 5, 2012), n.p., accessed October 5, 2014, http://www.ietf.org/proceedings/85/slides/slides-85-ncrg-0.pdf.
1.将网络设计分解为单独的部分,作为单独的配置组件实现(跨网络中的所有设备)。
1. Decompose the network design into individual pieces, implemented as individual configuration components (across all devices in the network).
2.在这些单独的组件之间建立一个连接网络。
2. Build a network of connections between these individual components.
3.测量该网络以确定配置的复杂性。
3. Measure this network to determine the complexity of the configuration.
图 3.1取自这些幻灯片4 ,说明了各个配置组件之间的联系。
Figure 3.1 is taken from these slides,4 illustrating the linkages between the various configuration components.
4 . 同上。
4. Ibid.
配置的项目越多,配置之间(特别是盒子之间)的互连越密集,设计就越复杂。通过检查这些因素,该过程会产生一个描述设计复杂性的数字。同一作者撰写的演示文稿和论文扩展了这一概念,以确定所实现的设计是否与设计意图相匹配,前提是设计意图以适合流程的方式表述。
The more items configured, and the more dense the interconnection between the configurations (particularly between boxes), the more complex the design is determined to be. By examining these factors, the process yields a single number describing the complexity of the design. The presentation and the papers written by the same authors extend this concept to determining if the implemented design matches the design intent, given the design intent is stated in a way amenable to the process.
不仅测量配置线,而且测量配置线之间的互连,这一概念非常强大。毫无疑问,大部分复杂性来自于遍布整个网络的相互关联的配置,这些配置是实施单一策略或使单一协议或流程工作所需的。查看交互或配置网络,也会将配置行数作为复杂性的衡量标准。由于某些设备可以用几行简单的代码表达策略,而其他设备则需要大量配置来表达相同的策略,因此这是一个有用的结果。
The concept of measuring not just the lines of configuration, but the interconnections between the lines of configuration, is extremely powerful. There is little doubt that much of complexity comes from the interrelated configurations spread throughout a network required to implement a single policy or to make a single protocol or process work. Looking at the interactions, or network of configurations, also takes the number of lines of configuration somewhat out of the picture as a measure of complexity. Because some devices can express a policy in a few simple lines, while others require a lot of configuration to express the same policy, this is a useful outcome.
然而,与此同时,配置行之间的互连可能会陷入同样的问题,仅计算配置行的数量就会陷入困境 - 整个策略可能由一台设备上的单行配置表示,而需要制定一系列政策。例如,在 Cisco IOS 软件上,命令remove-private-as用于删除 BGP 路由通告中的任何私有 AS 编号。这单个命令本质上取代了一组必须互连的配置命令,例如过滤器以及该过滤器对一组特定 BGP 对等体的应用。两种配置都是有效的并且执行相同的一组操作,但是根据所描述的度量,它们似乎具有完全不同的复杂性级别。使情况更加复杂的是,不同的 BGP 实现可能使用不同的命令集来执行相同的操作,从而使一个配置看起来比另一个配置更复杂,尽管它们都实现相同的策略。
At the same time, however, the interconnections between lines of configuration can fall prey to the same problems just counting the number of lines of configuration can fall prey to—an entire policy might be represented by a single line of configuration on one device, while requiring a number of lines of policy on another. For instance, on Cisco IOS Software, the command remove-private-as is used to remove any private AS numbers in a BGP route advertisement. This single command essentially replaces a set of configuration commands that would necessarily be interconnected, such as a filter and an application of that filter to a particular set of BGP peers. Both configurations are valid and perform the same set of actions, but they would appear to have completely different complexity levels according to the measure as it’s described. Further complicating the situation, different BGP implementations might use different sets of command to perform the same action, making one configuration appear more complex than another, although they’re both implementing the same policy.
上述措施的另一个失败是,哪些部分组合在一起制定政策并不总是显而易见的。例如,删除 AS 中每个 eBGP 扬声器上的私有 AS 编号的配置在测量中可能看起来不相关;除非您知道配置的意图,否则这些多个配置没有以明显的方式重叠或交互的特定点。因此,某些策略可能很容易被忽略,因为它们由配置组成,没有明显的结合点。
Another failing of the measure described above is that it’s not always obvious what pieces fit together to make a policy. For instance, a configuration removing private AS numbers on every eBGP speaker in an AS might not appear to be related within the measurement; there is no specific point at which these multiple configurations overlap or interact in a way that’s obvious, unless you know the intent of the configuration. Thus some policies might easily be missed as they consist of configurations with no obvious point at which they tie together.
最后,很难评估如何在这种网络复杂性衡量标准中管理用于实施多个策略的单一配置,然而,从复杂性的角度来看,这是最棘手的管理问题之一,因为这正是那些困难的问题之一。管理原本不相关的策略实施之间的交互面。衡量的各种政策如何相互作用?在这一点上,建模设计的复杂性是沉默的。
Finally, it’s difficult to assess how a single configuration used to implement multiple policies would be managed in this measure of network complexity—and yet, this is one of the thorniest problems to manage from a complexity standpoint, as this is precisely one of those difficult to manage interaction surfaces between otherwise unrelated policy implementations. How do the various policies measured interact? On this point, modeling design complexity is silent.
如前所述,NCI 根据规模和感知的子组件来衡量复杂性;建模设计评估互连配置线网络的复杂性。如何根据保持代表网络拓扑的分布式数据库在参与控制平面的所有设备之间同步所需的工作量来衡量复杂性?这正是 NetComplex 所做的。正如 Chun、Ratnasamy 和 Kohler 所说:
As previously discussed, the NCI measures complexity based on scale and perceived subcomponents; modeling design rates complexity on the network of interconnected lines of configuration. What about measuring complexity based on the amount of work needed to keep the distributed database that represents the network topology synchronized across all the devices participating in the control plane? This is precisely what NetComplex does. As Chun, Ratnasamy, and Kohler state:
我们推测,网络系统特有的复杂性源于需要确保状态与其分布式依赖项保持同步。我们在本文中开发的度量标准反映了这一观点,并且我们说明了几种系统,对于这些系统,这种以依赖关系为中心的方法似乎适当地反映了系统的复杂性。5
We conjecture that the complexity particular to networked systems arises from the need to ensure state is kept in sync with its distributed dependencies. The metric we develop in this paper reflects this viewpoint and we illustrate several systems for which this dependency centric approach appears to appropriately reflect system complexity.5
5 . Byung-Gon Chun、Sylvia Ratnasamy 和 Eddie Kohler,“NetComplex:网络系统设计的复杂性度量”(第五届 Usenix 网络系统设计和实现 NSDI 研讨会,2008 年,2008 年 4 月),np,2014 年 10 月 5 日访问,http : //berkeley.intel-research.net/sylvia/netcomp.pdf。
5. Byung-Gon Chun, Sylvia Ratnasamy, and Eddie Kohler, “NetComplex: A Complexity Metric for Networked System Designs” (5th Usenix Symposium on Networked Systems Design and Implementation NSDI 2008, April 2008), n.p., accessed October 5, 2014, http://berkeley.intel-research.net/sylvia/netcomp.pdf.
NetComplex 评估网络中的依赖状态链,为每个依赖项分配一个度量,然后根据这些分配的度量计算单个复杂性度量。图 3.2说明了 NetComplex 中的依赖性和复杂性。
NetComplex evaluates the chain of dependent states in the network, assigns a metric to each dependency, and then calculates a single complexity measure based on these assigned metrics. Figure 3.2 illustrates dependency and complexity in NetComplex.
在此图中:
In this figure:
• 路由器 C 和 D 依靠 E 来获取路由器 E 之外的网络的正确视图。
• Routers C and D depend on E to obtain a correct view of the network beyond Router E.
• 路由器 B 依靠路由器 C、D 和 E 来获取路由器 E 之外的网络的正确视图。
• Router B depends on Routers C, D, and E to obtain a correct view of the network beyond Router E.
• 路由器 A 依靠路由器 B、C、D 和 E 来获取路由器 E 之外的网络的正确视图。
• Router A depends on Routers B, C, D, and E to obtain a correct view of the network beyond Router E.
因此,路由器 A“积累”了整个网络中源自 E 之外的信息同步的复杂性。通过这些依赖性,路由器 A 被称为链接到网络中的其余路由器。通过检查这些链接,并将它们与本地状态相结合,保持整个控制平面同步的复杂性可以给出一个单一的度量标准。
Hence, Router A “accumulates” the complexity of synchronization of information originating beyond E through the entire network. Through these dependencies, Router A is said to be linked to the remaining routers in the network. By examining these links, and combining them with the local state, the complexity of keeping the entire control plane synchronized can be given a single metric.
通过关注状态量和状态通过网络传输的方式,NetComplex 很好地描述了控制平面的复杂性。基于此,NetComplex 可用于确定通过网络携带源路由信息所需的额外复杂性,并基于此源路由信息进行转发。Netcomplex 的另一个有用之处是对每个流(而不是每个目的地/虚拟拓扑)转发流量所需的附加状态信息进行度量。
By focusing on the amount of state and the way state is carried through the network, NetComplex does a good job of describing the complexity of a control plane. Based on this, NetComplex is useful for determining the additional complexity required to carry source routing information through the network, and forward based on this source routing information. Another place where Netcomplex would be useful is in putting a metric on the additional state information required to forward traffic on a per flow, rather than per destination/virtual topology basis.
然而,NetComplex 专注于单个管理或故障域内的控制平面。例如,无法解释通过路由聚合隐藏的信息,也无法区分拓扑信息和可达性(例如在链路状态泛洪域边界处发生的情况)。NetComplex 不与策略一起工作,也不与策略实施一起工作;它也不涉及流量、子网规模或网络密度。
NetComplex, however, is focused on the control plane within a single administrative or failure domain. There is no way, for instance, to account for the information hidden through route aggregation, nor to differentiate between topology information and reachability (such as what happens at a link state flooding domain boundary). NetComplex doesn’t work with policies, nor policy implementation; nor does it deal with traffic flows, subnetwork scale, or network density.
目前已经研究了网络复杂性的三种不同衡量标准:NCI、建模设计复杂性和 NetComplex。这些方法中的每一个都试图以某种方式至少测量网络复杂性的四个领域(状态、速度、表面和优化)的某些组成部分。然而,它们都没有衡量这三个领域中任何一个领域的所有内容,甚至没有一个能够接近衡量整体网络复杂性。为什么?问题不仅仅是测量和处理产生单个复杂性数字所需的所有信息的能力,它还嵌入到网络复杂性本身的问题中。
Three different measures of network complexity have been examined at this point: NCI, modeling design complexity, and NetComplex. Each of these attempts to measure, in some way, at least some component of the four realms of network complexity—state, speed, surface, and optimization. None of them, however, measure everything in any one of these three domains, and none of them even come close to measuring overall network complexity. Why? The problem isn’t just the ability to measure and process all the information needed to produce a single complexity number, it’s embedded in the problem of network complexity itself.
想象一下,一张台球桌上放着一组球。这些特定的球的弹性(至少接近)完美,因此当它们撞击另一个物体时,它们只会损失无限少量的能量,并且桌子两侧的保险杠的设计方式大致相同。这张桌子上也没有口袋,所以球没有地方离开桌子。现在,将球以某种随机分布的方式放在桌子上,然后敲击其中一个,从而引发连锁反应。结果将是一组统计上随机的运动,每个球在桌子上移动,撞击另一个球或保险杠,保留其大部分能量,然后沿直线向其他方向移动。
Imagine, for a moment, a pool table with a set of balls on it. These specific balls are (at least nearly) perfect in their resilience, so they lose only infinitely small amounts of energy when they strike another object, and the bumpers on the sides of the table are designed in much the same way. There are no pockets in this table, either, so there is no place for the balls to leave the table. Now, place the balls on the table in some random distribution, and then strike one so it starts a chain reaction. The result will be a statistically random set of movements, each ball moving about the table, striking another ball or a bumper, retaining most of its energy, and then moving in a straight line in some other direction.
这个特殊问题对于统计回归分析或数据科学可以提供的任何其他形式的分析来说已经成熟。数据科学家可以根据一组推导公式告诉您,一个球撞击另一个球的频率、系统需要多长时间才能耗尽能量、随机移动的球在什么时间会形成什么模式,以及许多信息。其他事情。数据科学擅长在看似随机的数据中寻找模式。事实上,我们经常发现数据集必须更大才能做出准确的预测;数据集越大,对其进行表征就越准确,并且对该数据在未来某个特定点的状态的预测也就越准确。
This particular problem is ripe for statistical regression analysis, or any other form of analysis data science can provide. The data scientist can tell you, based on a set of derived formulas, how often one ball will strike another, how long the system will take to run out of energy, what patterns will form in the randomly moving balls at what time—and many other things. Data science excels at finding patterns in seemingly random bits of data. In fact, it is often found that the data set must be larger to make an accurate prediction; the larger the data set, the more accurate characterization of it can be made, and the more accurate the predictions about the state of that data at some specific point in the future will be.
但让我们稍微改变一下情况。让我们以相同的台球桌、相同的球——所有相同的身体条件为例。只是这一次,有人预先计划了每个球的位置和运动,这样即使两个球都在运动,也不会互相撞击。事实上,每个球的运动在整个运动过程中都是相同的。
But let’s change the situation somewhat. Let’s take the same pool table, the same balls—all the same physical conditions. Only this time, someone has preplanned the position and movement of every ball such that no two balls strike one another, even though they are all in motion. In fact, the movement of every ball is identical throughout the entire time the balls are in motion.
关于这种特殊情况,数据科学可以告诉我们什么?什么都没有。
What can data science tell us about this particular situation? Nothing.
简单的观察可以告诉我们哪个球在任何时间点都会在哪里。简单的观察甚至可以提供一个公式,告诉我们桌子上哪里会出现球团,或者险些脱手。但统计分析不能超出一些简单的事实。更有趣的是,统计分析无法告诉我们以这种方式排列这些球有什么意义。
Simple observation can tell us which ball will be where at any point in time. Simple observation might even be able to provide a formula telling us where there will be clumps of balls on the table, or near misses. But statistical analysis cannot go much beyond a few simple facts here. What’s more interesting is that statistical analysis cannot tell us what the point is in having these balls arranged just this way.
这就是组织复杂性的问题。
This is the problem of organized complexity.
正如 Warren Weaver 在 1948 年指出的那样:
As Warren Weaver noted in 1948:
这种处理无序复杂性的新方法比早期的二变量方法有了如此强大的进步,留下了一个广阔的领域。人们很容易过于简单化,并说科学方法从一个极端走向另一个极端——从两个变量到一个天文数字——而没有触及一个伟大的中间区域。此外,这个中间区域的重要性并不主要取决于所涉及的变量数量是否适中——与两个相比较大,但与一小撮盐中的原子数量相比却很小。事实上,这个中间区域的问题往往会涉及相当多的变数。这个中部地区问题的真正重要特征,科学迄今尚未探索或征服,在于这些问题,与统计可以应对的无组织情况相比,它显示了组织的基本特征。事实上,我们可以将这组问题称为有组织的复杂性问题。6
This new method of dealing with disorganized complexity, so powerful an advance over the earlier two-variable methods, leaves a great field untouched. One is tempted to oversimplify, and say that scientific methodology went from one extreme to the other—from two variables to an astronomical number—and left untouched a great middle region. The importance of this middle region, moreover, does not depend primarily on the fact that the number of variables involved is moderate—large compared to two, but small compared to the number of atoms in a pinch of salt. The problems in this middle region, in fact, will often involve a considerable number of variables. The really important characteristic of the problems of this middle region, which science has as yet little explored or conquered, lies in the fact that these problems, as contrasted with the disorganized situations with which statistics can cope, show the essential feature of organization. In fact, one can refer to this group of problems as those of organized complexity.6
6 . 沃伦·韦弗 (Warren Weaver),“科学与复杂性”,《美国科学家》 36 (1948):539。
6. Warren Weaver, “Science and Complexity,” American Scientist 36 (1948): 539.
这个有组织的复杂性领域准确地描述了工程师在研究计算机网络时所面临的情况。无论从哪个角度处理计算机网络,问题都是复杂且有组织的。
This field of organized complexity exactly describes the situation engineers face in looking at computer networks. No matter what angle a computer network is approached from, the problem is both complex and organized.
• 设计协议时要牢记一组特定的目标、关于如何解决问题的特定思维方式,以及当前最佳使用、未来灵活性、可支持性和易于实施之间的一系列权衡。
• Protocols are designed with a specific set of goals in mind, a specific mindset about how the problems approached should be solved, and a set of tradeoffs between current optimal use, future flexibility, supportability, and ease of implementation.
• 在网络之上运行的应用程序在设计时考虑了一组特定的目标。
• Applications that run on top of a network are designed with a specific set of goals in mind.
• 提供使计算机网络正常工作的元数据的控制平面在设计时考虑了一组特定的目标。
• Control planes that provide the metadata that make a computer network work are designed with a specific set of goals in mind.
• 通过网络在各个级别传输信息的协议在设计时都考虑到了一组特定的目标。
• Protocols that carry information through the network, at every level, are designed with a specific set of goals in mind.
无论考虑计算机网络中的哪个系统(从协议到设计,从应用程序到元数据),每个系统都是根据一组特定的目标、关于如何解决当前问题的特定思维方式以及一组特定的权衡而设计的。其中一些可能是隐含的,而不是明确的,但它们仍然是有意的目标或目标。
No matter which system within computer network is considered—from protocols to design to applications to metadata—each one was designed with a specific set of goals, a specific mindset about how to solve the problems at hand, and a specific set of tradeoffs. Some of these might be implicit, rather than explicit, but they are, nonetheless, intentional goals or targets.
网络不仅仅是一个表现出有组织的复杂性的单个系统,而是许多不同的连锁系统,每个系统都表现出有组织的复杂性,并且所有这些系统组合起来也表现出一组目标(也许是一组更短暂的目标,例如”作为“使业务增长”,但仍然是一组目标)。
A network is not just a single system that exhibits organized complexity, but a lot of different interlocking systems, each of which exhibits organized complexity, and all of which combined exhibit a set of goals as well (perhaps a more ephemeral set of goals, such as “making the business grow,” but a set of goals nonetheless).
因此,网络复杂性不能按照传统意义上简单地测量、计算和“解决”。即使一切都可以在单个网络中测量,即使通过这种测量收集的所有信息都可以以某种有意义的方式进行处理,仍然不可能完全表达计算机网络的复杂性。本质上是因为无法衡量或表达意图。
Network complexity, then, cannot simply be measured, computed, and “solved,” in the traditional sense. Even everything could be measured in a single network, and even if all the information gathered through such measurement could be processed in a way that made some sense, it would still not be possible to fully express the complexity of a computer network in all its myriad parts—in essence because there is no way to measure or express intent.
这并不意味着尝试测量网络复杂性是浪费时间。然而,它的真正含义是,必须以极大的谦逊态度来解决这个问题。工程师需要非常小心地理解设计每个部分中所做的权衡,并非常有意识地记住,准确预测任何特定设计决策的结果是有限的。
None of this means it is wasting time to attempt to measure network complexity. What it does mean, however, is that the problem must be approached with a large dose of humility. Engineers need to be very careful about understanding the tradeoffs being made in every part of the design, and very intentional in remembering that there are limits to accurately predicting the outcome of any particular design decision.
工程师不应“放弃”,而应尽一切可能来理解复杂性,控制复杂性,最小化复杂性,并做出明智的权衡——但现在没有、也永远不会有解决复杂性的灵丹妙药。正如第一章“定义复杂性”中所解释的,存在三组,其中只能选择两组,并且存在一些曲线,其中增加一个轴的复杂性以解决特定问题实际上会导致另一个轴出现问题。
Instead of “giving up,” engineers should do everything possible to understand the complexity, to contain it, to minimize it, and to make intelligent tradeoffs—but there isn’t, and won’t ever be a silver bullet for complexity. As explained in Chapter 1, “Defining Complexity,” there are sets of three out of which only two can be chosen, and there are curves where increasing complexity in one axis to solve a particular problem actually causes problems in another axis.
衡量和管理复杂性并不是浪费时间,除非您相信自己能够真正解决问题——因为“问题”无法“解决”。
Measuring and managing complexity is not wasting time unless you believe you can actually solve the problem—because the “problem” cannot be “solved.”
迄今为止,这项复杂性调查得出的结论是:
This investigation of complexity has so far concluded:
• 解决难题需要复杂性,特别是在稳健设计领域。
• Complexity is necessary to solve difficult problems, particularly in the area of robust design.
• 超过一定水平的复杂性实际上会导致脆弱性——稳健但脆弱。
• Complexity beyond a certain level actually causes brittleness—robust yet fragile.
• 在系统层面以任何有意义的方式衡量复杂性是困难的(或者也许是不可能的)。
• Complexity is difficult (or perhaps impossible) to measure in any meaningful way at the systemic level.
• 有许多类别的问题无法解决三个目标中的两个以上(例如快速、廉价、高质量)。
• There are a number of classes of problems where it is impossible to resolve for more than two of three goals (such as fast, cheap, high quality).
考虑到这组要点,这似乎是路的尽头。网络工程师依赖于无法有效测量的东西,而测量始终是控制和管理问题集的第一步。最终,永远不会有一个单一的数字或公式可以描述网络的复杂性。我们是否应该简单地戴上海盗帽并宣称“放弃所有进入这里的人的希望”?或者有什么办法可以摆脱这个困境吗?
Given this set of points, it might seem like this is the end of the road. Network engineers are reliant on something that cannot be effectively measured—and measurement is always the first step in controlling and managing a problem set. There will never, in the end, be a single number or formula that can describe network complexity. Should we simply put on our pirate hats and proclaim, “Abandon hope all ye who enter here”? Or is there some way out of this corner?
事实上,有一种合理的方法来处理现实世界中的复杂性。与其试图找到绝对的“复杂性度量”,或者找到某种能够“解决”复杂性的算法,不如构建一种启发式方法,或者一种查看问题集的方法,从而找到解决方案。在这种情况下,启发式过程分为两部分。
There is, in fact, a reasonable way to approach complexity in the real world. Rather than trying to find an absolute “measure of complexity,” or find some algorithm that will “solve” complexity, it’s possible to construct a heuristic, or a method of looking at the problem set that will enable a path to a solution. The heuristic, in this case, is a two-part process.
首先,揭示网络设计中固有的复杂性权衡。揭示这些权衡将有助于工程师在为特定问题集选择任何特定解决方案时,就获得什么和失去什么做出明智的选择。每个问题不可能得到同样好的解决;在某一点应用的任何解决方案都会增加其他地方的复杂性。
First, expose the complexity tradeoffs inherent in network design. Exposing these tradeoffs will help engineers make intelligent choices about what is being gained, and what is being lost, when choosing any particular solution to a particular problem set. Every problem cannot be solved equally well; any solution applied at one point will increase complexity somewhere else.
换句话说,网络工程师需要学会有意识地对待复杂性。
To put it in other terms, network engineers need to learn to be intentional about complexity.
下一章将开始讨论三个特定的复杂领域——操作、设计和协议。在每种情况下,都会考虑设计师必须进行权衡的几个地方,以说明将复杂性公开化的过程。工程师在设计和管理网络时越有意识地做出有关复杂性的决策,就越有可能满足现实世界对网络的需求。
The next chapter will begin looking at three specific realms of complexity—operational, design, and protocol. In each of these cases, several places where designers must make tradeoffs to illustrate the process of bringing complexity out into the open will be considered. The closer engineers get to making intentional decisions about complexity when designing and managing networks, the more likely meeting the real-world demands placed on networks will be possible.
本章分两个阶段讨论操作复杂性。第一部分探讨问题空间;第二部分考虑各种解决方案、它们如何解决复杂性问题以及每种解决方案所涉及的权衡。虽然这些部分并不详尽,但它们将概述如何寻找操作中的复杂性,以及如何思考各种可用解决方案的案例研究。
This chapter addresses operational complexity in two stages. The first section explores the problem space; the second considers the various solutions, how they address the complexity issues, and tradeoffs involved in each one. While these sections will not be exhaustive, they will provide an overview of where to look for complexity in operations, and case studies of how to think through the various solutions available.
本节考虑两个更大的主题,每个主题都有两个更具体的用例或研究领域。首先是人类与网络作为一个系统交互的成本。人与网络之间的交互超出了简单的用户界面部分,而是工程师可以通过一组心智模型、协议操作、业务和策略概念以及其他领域来理解网络的方式。特别是,策略在第二个主题中脱颖而出,这是网络设计中很少考虑的一个领域,即策略分散与网络中的最佳流量。
This section considers two larger topics, each with two more specific use cases or areas of investigation. The first is the cost of human interaction with the network as a system. The interaction between people and the network reaches beyond the simple user interface piece of the puzzle, and into the way in which engineers can understand the network through a set of mental models, protocol operations, business and policy concepts, and other areas. Policy, in particular, comes to the fore in the second topic, an area rarely considered in network design, policy dispersion versus optimal traffic flow through the network.
这里的示例并不是对该领域中各种问题的“结束一切”的描述,而是试图描述充分描述该领域的最小用例集。
The examples here are not an “end all, be all,” description of the various sets of problems in this space, but rather an attempt to describe a minimal set of use cases that describe the space adequately.
人类通过许多不同的工作流程与网络交互,包括设计、部署、管理和故障排除。虽然每个这些工作流程旨在实现一件事——在网络上部署新服务或应用程序——它们实际上都必须通过与分散在网络中的大量设备的交互来完成。可以推断出一些关键原则:
Humans interact with networks through a number of different workflows, including design, deployment, management, and troubleshooting. While each of these workflows is intended to result in one thing—the deployment of a new service or application on the network—they all must actually be completed through interaction with a large number of devices scattered throughout the network. A number of key principles can be inferred:
1.需要人类触摸才能执行结果的设备数量与该网络的操作复杂性相关。
1. The number of devices that need to be touched by a human to perform an outcome correlates to the operational complexity of that network.
2.这种运营复杂性直接转化为运营支出 (OPEX)。
2. This operational complexity directly translates into Operational Expenditures (OPEX).
3.降低运营复杂性将带来更精简、更高效、更高投资回报 (ROI) 的网络。
3. Reducing the operational complexity will result in leaner, more productive, higher Return on Investment (ROI) networks.
4.影响操作复杂性的设备数量包括直接“接触”的设备和“引用”的设备(例如在策略定义中)。
4. The number of devices affecting operational complexity includes both devices directly “touched” and devices “referenced” (as for example in a policy definition).
深入研究几个关键案例将有助于了解这种操作复杂性的根本原因,这种操作复杂性的症状是人机交互次数增加。
It will be useful to dive deeper into a couple of key cases to understand the root cause of this operational complexity that has the symptom of increased number of human interaction times.
对于第一个用例,请考虑运营商在网络中的所有边缘设备上实施策略。理解这个问题的最简单方法是考虑运营商必须接触网络边缘的每个设备。这种手动部署新策略的方法(至少)存在四个主要问题:
For the first use case, consider an operator implementing a policy across all the edge devices in a network. The simplest way to understand this problem is to consider that the operator must touch each of the devices along the network edge. There are (at least) four major problems with such an approach to deploying this new policy manually:
• 要手动部署新策略,运营商需要接触网络中的每个边缘设备(可能有数千个)。这可能需要数千小时的网络工程工作,工程师可以花时间思考更富有成效的事情,例如下一波新设备或设计挑战。
• To deploy the new policy manually, the operator would need to touch each edge device in the network—potentially thousands of them. This could take thousands of hours of network engineering work, time that engineers could be thinking about more productive things, like the next wave of new equipment or design challenges.
• 随着部署此新策略所需的时间的推移,要求(以及策略)可能会发生变化。这并不总是一个明显的结果,但在现实生活中,新政策(或其他网络变化)的多次推出被中途停止以管理“即时暴政”,然后从未完成。结果是网络中的策略杂乱无章,以看似随机的方式部署在整个网络中。
• Over the time required to deploy this new policy, requirements (and hence the policy) could change. This isn’t always an obvious result, but there are real-life situations in which multiple rollouts of new policies (or other network changes) were stopped midstream to manage the “tyranny of the immediate,” and then never finished. The result is a network with a mishmash of policies deployed in a seemingly random fashion throughout the network.
• 即使新策略在一段时间内完全部署,部署期间网络也会处于不一致的状态。这可能会使网络故障故障排除变得困难,导致政策冲突造成积极损害(例如泄露有关客户的机密信息),并导致许多其他意想不到的副作用。
• Even if the new policy is fully deployed over some period of time, the network will be in an inconsistent state during the deployment. This can make it difficult to troubleshoot network failures, lead to conflicting policies causing positive harm (such as the release of confidential information about customers), and cause many other unintended side effects.
• 在长时间内配置大量设备时,人们很容易犯错误。网络设备配置中出现的错误导致网络中断的时间间隔称为平均错误间隔时间 (MTBM)。就像 MTBF 和 MTTR 一样,MTBM 可以被跟踪和管理,但大规模手动配置设备总会导致随着时间的推移,配置中出现错误。
• It’s common enough for humans to make a mistake when configuring a large number of devices over a long period of time. The amount of time between mistakes woven into the configuration of network devices causing a network outage can be called the Mean Time Between Mistakes (MTBM). The MTBM can, just like the MTBF and MTTR, be tracked and managed—but manually configuring devices on a large scale will always result in mistakes creeping into configurations over time.
第二个感兴趣的用例是对网络故障进行故障排除,这直接影响网络可用性的衡量标准之一,即 MTTR。对大型系统进行故障排除的过程通常既是一门艺术,也是一门工程技能,包括多个阶段或工作领域:
A second use case of interest is troubleshooting a network failure, which directly impacts one of the measures of network availability, the MTTR. The process of troubleshooting a large-scale system is often as much an art as it is an engineering skill, including multiple phases or areas of work:
• 问题识别,通常涉及某种形式的半分裂以及预期(或理想)状态与实际状态的比较。识别问题通常会花费一半以上的故障排除过程(重要的是不知道用锤子敲什么,重要的是知道在哪里敲)。
• Problem identification, which normally involves some form of half splitting and comparison of expected (or ideal) state with actual state. Identifying the problem often consumes more than half the troubleshooting process (it’s not knowing what to strike with the hammer, it’s knowing where to strike it that matters).
• 问题修复,通常涉及更换或重新配置设备。通常,这是以提供临时而非永久修复的方式进行的。请注意,虽然临时修复已到位,但网络仍会遇到上述策略部署场景中的相同问题 - 如果即时的暴政占据上风,并且临时修复从未被替换或验证,则网络可能会建立起来稍后会导致大量问题的“修复层”或技术债务。
• Problem remediation, which normally involves replacing or reconfiguring the device. Often this is undertaken in a way that provides a temporary, rather than permanent, fix. Note that while a temporary fix is in place, the network is subject to the same problems described above in the policy deployment scenario—if the tyranny of the immediate takes over, and the temporary fix is never replaced or verified, the network can build up a “layer of fixes” or technical debt that causes a large number of problems later on.
• 根本原因分析,通常涉及更深入地研究问题症状、临时解决方案以及从网络收集的任何进一步信息,以比“这就是问题”更深入。根本原因分析会查找何时进行导致问题的更改,以及为什么任何更改管理流程都没有捕获错误。根本原因分析的目的是验证临时修复是否是正确的解决方案或将其替换为更永久的解决方案,并尝试确保问题将来不会再次发生。
• Root cause analysis, which normally involves taking a deeper look at the problem symptoms, the temporary fix, and any further information gathered off the network to go deeper than “this is the problem.” Root cause analysis looks for when changes that caused problems were made, and why any change management process didn’t catch the error. The point of root cause analysis is to verify that the temp fix is the correct solution or replace it with a more permanent solution, and to try and ensure the problem doesn’t occur again in the future.
在解决复杂性网络问题时,有几个要点会发挥作用,包括:
Several points come into play when troubleshooting a network problem in regard to complexity, including:
• 操作员必须接触才能解决问题的设备数量。这在范围和概念上与操作员部署策略必须接触的设备数量相似。
• The number of devices the operator must touch to troubleshoot the problem. This is similar in scope and concept as the number of devices the operator must touch to deploy a policy.
• 网络中可进行测量的位置数量以及进行这些测量的难度。
• The number of places measurements can be taken in the network, and how difficult it is to take those measurements.
• 可用于定义网络正常情况的信息量,从而回答“发生了什么变化?”的问题。
• The amount of information available to define what the network normally looks like, and hence answer the question, “what’s changed?”
为了正确理解这些,请考虑一个特定的用例。考虑一个扁平 IP 网络,其中源节点失去与目标节点的连接。在此示例中,假设源和目标之间有一些设备。负责解决此问题的工程师可能会访问每个设备(例如,从默认网关开始,沿着流量应经过的路径通过路由器逐跳访问每个设备)并发出许多命令来了解节点的状态并确定路径中的哪个节点导致中断。随着网络规模和复杂性的增长,通过网络的路径变得更难以追踪,并且实时跳过的设备数量可能变得几乎不可能(当错误发生时)。如果网络超出单个管理域,则由于访问限制,从路径上的每个设备收集信息可能变得不可能。网络规模越大、越复杂,使用手动方法进行故障排除就越困难。
To put these into perspective, consider a specific use case. Consider a flat IP network in which a source node loses connectivity to a destination node. In this example, assume there are a few devices between source and destination. The engineer tasked with troubleshooting this problem could potentially access each device (say by accessing each device hop-by-hop through the routers along the path through which the traffic should travel, starting with the default gateway) and issue a number of commands to understand the node’s state and determine which of the nodes in the path is causing the outage. As the network grows in scale and complexity, the path through the network becomes harder to trace, and the number of devices to hop through may become close to impossible in real time (as the error is occurring). If the network grows beyond a single administration domain, gathering information from each device along the path may become impossible due to access restrictions. The larger and more complex the network is, the more difficult it will be to troubleshoot using manual methods.
网络设计者通常不会考虑控制平面复杂性和最佳利用率之间的关系,但通过策略分散的概念存在明显的联系。政策分散只不过是其中之一如上一节所述,人类与网络交互所涉及的成本 - 需要配置以实施特定策略的设备数量。一个具体的例子将有助于理解问题的范围。本例使用图 4.1中的网络。
Network designers don’t often think about the relationship between control plane complexity and optimal utilization, but there is a clear link through the concept of policy dispersion. Policy dispersion is nothing more than one of the costs involved in the human interaction with the network, as outlined in the previous section—the number of devices that require configuration to implement a specific policy. A specific example will help in understanding the scope of the problem. Use the network in Figure 4.1 for this example.
假设需要对传入流量应用三组策略:
Assume three sets of policies need to be applied to incoming traffic:
• 服务质量策略对源自主机A 的大文件传输进行分类,以便将这些流放入沿路径的较低分类队列中。
• A quality of service policy classifying large file transfers originating from Host A so these flows are placed into a lower classification queue along the path.
• 服务质量策略对源自主机B 的较小文件传输进行分类,以便将这些流放置在沿路径的中等优先级队列中。
• A quality of service policy classifying smaller file transfers originating from Host B so these flows are placed in a medium priority queue along the path.
• 服务质量策略将来自连接到“边缘”框中任何路由器的任何主机的语音流量分类到高优先级队列中。
• A quality of service policy classifying voice traffic originating from any host connected to any router in the “Edge” box into a high priority queue.
这里关注的不是如何实施这些政策,而是在哪里实施这些政策,以及在决定在哪里实施这些政策时所涉及的权衡。这些策略可以在所示网络中的三个位置实施:
The concern here is not with how these policies might be implemented, but where they might be implemented—and the tradeoffs involved in deciding where to implement them. There are three places these policies can be implemented in the network illustrated:
• 部署这些策略的最自然位置是网络边缘(示例中的路由器 C 和 D),并且可能是层次结构中同一级别的所有其他设备。此设计的优点是在边缘设备与路由器 E、F、G 和 H(以及网络的其余部分)之间的链路上强制执行策略,因此整个路径上的流量得到最佳处理。此解决方案的缺点是,边缘路由器的配置变得更加复杂,因为添加了额外的配置来实施策略,并且(可能)存在大量边缘设备(在本例中只有十个,但数量很大)规模网络可能有数千个)。
• The most natural place to deploy these policies would be along the edge of the network, (Routers C and D in the example), and potentially all the other devices at the same level in the hierarchy. This design has the advantage of enforcing the policy on the links between the edge devices and Routers E, F, G, and H (and the rest of the network), so traffic is handled optimally all along the path. The disadvantage of this solution is the configuration of the edge routers becomes more complex as additional configuration is added to implement the policy—and there are (potentially) a large number of edge devices (in this case there are only ten, but in large-scale networks there could be many thousands). The number of devices on which the policy must be configured can be traded off against the uniformity of the configuration for all edge devices throughout the network, of course, but this doesn’t “solve” the complexity problem, it just moves the complexity from one place to another in the operational chain.
• 第一个替代方案是在路由器 E、F、G 和 H 上实施这些策略。在这种情况下,必须同步策略的路由器数量要少得多,但边缘路由器与这四个路由器之间的链路主机生成的流量不会以最佳方式使用路由器。代价是只有四个路由器必须在其上实施和维护策略,从而使策略的实施和同步更加容易。
• The first alternative would be to implement these policies at Routers E, F, G, and H. In this case, the number of routers across which the policies must be synchronized is much smaller, but the links between the edge routers and these four routers will not be used optimally by the traffic flows being generated by the hosts. The tradeoff is there are only four routers on which the policies must be implemented and maintained—making the implementation and synchronization of policies easier.
• 第二种选择是在路由器M 和N 上实施这些策略,因此它们主要控制沿着将目标主机连接到网络的以太网链路流动的流量。这会增加每条未达到最佳使用效果的路径的长度,并且还会减少必须配置和管理策略的设备数量。
• The second alternative would be to implement these policies at Routers M and N, so they primarily control the traffic flowing along the Ethernet link connecting the destination host to the network. This increases the length of each path that will be used suboptimally, and also decreases the number of devices on which the policies must be configured and managed.
在这种情况下,答案可能看起来很简单——沿网络边缘配置策略。然而,在具有多个边缘模块的大型网络中,选择变得更加复杂。
The answer, in this case, might appear to be simple—configure the policies along the edge of the network. In larger networks with a number of edge modules, however, the choice becomes more complicated.
如果有数千台边缘设备,并且每台边缘设备都配置相同的策略,以保证应用和配置的一致性,以及作为整个网络的最佳网络使用,有多少设备必须具有同步配置,以及错误的成本是多少?如果每个边缘设备仅配置其所需的策略,那么维护一千种不同配置的成本是多少,以及错误的成本是多少?
If there are thousands of edge devices, and each edge device is configured with the same policies to ensure consistency of application and configuration, as well as optimal network usage throughout the network, how many devices must have synchronized configurations, and what is the cost of a mistake? If every edge device is configured with just the policy it needs, then what is the cost of maintaining a thousand different configurations, and what is the cost of a mistake?
网络的哪些特定区域需要实施最佳流量,为什么?在这个小(且人为的)示例中,边缘路由器与路由器 E、F、G 和 H 之间的链路真的需要进行策略控制吗?简单地在层次结构的第一层上投入带宽来解决服务质量问题,以便可以在网络中的更少数量的设备上实施相关策略,是否会更有效?
Which specific areas of the network need to enforce optimal traffic flow, and why? In this small (and contrived) example, do the links between the edge routers and Routers E, F, G, and H really need to be policy controlled? Would it be more efficient to simply throw bandwidth at the problem of quality of service along the first layer of hierarchy so the relevant policies can be implemented on a smaller number of devices further into the network?
这些问题并不像乍看起来那么容易回答。要了解原因,请转向网络设计中使用的一些机制来管理之前考虑的操作复杂性。
These questions aren’t as easy to answer as they might first appear. To see why, turn to some of the mechanisms used in network design to manage the operational complexities previously considered.
如果您是一名经验丰富的网络工程师,那么当您到达这一点时,您脑海中应该有一个声音尖叫“自动化!” 事实上,人类与网络交互以及通过网络进行策略分散的问题可以通过自动化管理工具来解决——在许多情况下这无疑是一个不错的选择。然而,网络设计和工程中存在一种将复杂性越过隔间墙转移到网络操作上的历史趋势,这意味着“网络管理系统将解决该问题”。将复杂性从网络设计或协议转移到网络管理系统并不能真正降低复杂性;相反。它只是将其隐藏在部门部门(或隔间)墙后面。
If you’re an experienced network engineer, by the time you get to this point, a voice in your head should be screaming “automate it!” The problems of human interaction with the network and policy dispersion through the network are, in fact, addressable through automated management tools—and this is certainly a good option in many cases. However, there is a historical tendency in network design and engineering to throw complexity over the cubicle wall onto the network operations, implying “the network management system will manage that problem.” Moving complexity from the network design or the protocol into the network management system doesn’t really reduce complexity; it just hides it behind a department division (or cubicle) wall.
当网络工程师和设计人员在尝试降低复杂性时仅考虑整个网络的一部分,而不是采用系统架构方法(其中复杂性在模块之间的转移非常明显)时,这种趋势会进一步加剧。只有通过查看整个系统,您才能看到消除某个离散区域的复杂性所产生的全部影响。
This tendency is further exacerbated when network engineers and designers consider only a part of the overall network when trying to reduce complexity, instead of taking a systemic architectural approach in which the moving of complexity from module to module is quite evident. Only by looking at the overall system can you see the full impact of removing complexity from one discrete area.
管理操作复杂性的三种不同机制可以更详细地说明这一点。自动化,因为它是所有大型网络最喜欢的工具,所以会首先考虑,其次是模块化,最后增加协议复杂性以降低管理复杂性。
Three different mechanisms for managing operational complexity can illustrate this in more detail. Automation, because it is the favorite tool of all large-scale networks, will be considered first, followed by modularity, and finally adding protocol complexity to reduce management complexity.
考虑上一节中给出的示例——部署一组服务质量策略。在这种情况下的基本选择是:
Consider the example given in the previous section—deploying a set of quality of service policies. The basic choice in that situation was to either:
• 在更多设备上更靠近网络边缘部署策略,从而更优化地利用可用带宽,同时在更多设备上管理策略。
• Deploy the policy closer to the network edge on more devices, thus making more optimal use of the available bandwidth, but managing the policy on a larger number of devices.
• 将策略部署在更靠近网络核心的位置,减少必须管理策略的设备数量,但(可能)会降低可用带宽的使用效率。
• Deploy the policy closer to the core of the network, reducing the number of devices on which the policy must be managed, but (potentially) using the available bandwidth less efficiently.
因此,可用的解决方案指出了管理复杂性和最佳网络利用率之间的权衡(这种权衡实际上在网络设计决策中很常见)。自动化似乎提供了摆脱这种情况的“第三种方式”,允许将策略放置在边缘设备上而无需人工干预,从而解决复杂性权衡问题;然而,事情并没有那么简单。复杂性曲线没有“灵丹妙药”。以下各节将考虑权衡。
Hence, the solutions available point to a tradeoff between managing complexity and optimal network utilization (a tradeoff that’s actually quite common in network design decisions). Automation appears to provide a “third way” out of this situation, allowing the policy to be placed on the edge devices without manual intervention, thus resolving the complexity tradeoff; however, it’s not that simple. There is no “silver bullet” for the complexity curve. Tradeoffs will be considered in the following sections.
自动化系统使“人类不再做出反应”。虽然这确实提供了管理流程的一致性,但它也使管理流程中无法识别问题和议题。这种人类互动的消除可能会导致所谓的“骨化”(该词用于描述物质变成化石的过程)。结果可能是一种坚硬但易碎的材料——坚固但脆弱是第 1 章“定义复杂性”中用来描述这一点的术语。
Automated systems take the “human out of the reaction.” While this does provide consistency in the management process, it also takes the recognition of problems and issues out of the management process, as well. This removal of human interaction can result in what might be called ossification (the word used to describe the process of material becoming fossilized). The result can be a hard, but brittle, material—robust yet fragile is the term used to describe this in Chapter 1, “Defining Complexity.”
自动化是脆弱的,因为自动化流程无法响应每种可能的故障模式或情况,因为此类系统的人类设计者无法想象每种可能的故障模式或情况。自动化系统的脆弱性可以通过定期的人工监督和更好的设计来抵消,但必须对这种交互进行管理,从而增加部署旨在降低复杂性的系统过程中的复杂性。在考虑自动化解决方案来解决管理复杂性问题时,需要进行明显的权衡。
Automation is brittle because automated processes can’t respond to every possible failure mode or situation—because the human designers of such systems can’t imagine every possible failure mode or situation. The brittleness of an automated system can be offset by periodic human oversight and better design, but such interactions must be managed, increasing complexity in the process of deploying a system designed to reduce complexity. There is a clear tradeoff here that needs to be managed when considering automated solutions to solve management complexity problems.
当部署编排系统来管理数据中心结构中的隧道覆盖时,可能会看到这种情况的一个示例。应用程序可以相当容易地通过编排系统在结构上建立新的隧道路径——随着时间的推移,可以在没有任何人工干预的情况下构建数千(或数万)或这样的隧道。但是,当应用程序构建足够的隧道来克服任何特定设备的转发表管理覆盖在拓扑上的隧道集的能力时,会发生什么?最好的情况是隧道创建过程失败,导致操作员查看情况并尝试找出发生了什么。最糟糕的是由于转发位置超载而导致网络完全故障。
An example of this situation might be seen when deploying an orchestration system to manage tunneled overlays in a data center fabric. Applications can be given the ability to bring up new tunneled paths across the fabric through the orchestration system fairly easily—over time thousands (or tens of thousands) or such tunnels can be built without any human intervention. But what happens when an application builds enough tunnels to overcome the ability of any particular device’s forwarding table to manage the set of tunnels overlaid onto the topology? The best case is a tunnel creation process fails, causing a human operator to look at the situation and try to find out what’s going on. The worst is a complete network failure due to an overloaded forwarding place.
僵化还会以另一种方式导致脆性——假设“如果它存在,就有人需要它”。在纯自动化解决方案中,通常很难通过各种系统日志文件追踪特定隧道(在本示例中)的配置原因。运营商如何确定不存在因不正确的清理(或垃圾收集)过程而导致不必要的网络资源浪费的未使用隧道组不断增加?
Ossification also leads to brittleness in another way—the assumption that “if it exists, it is needed by someone.” In a purely automated solution, it is often difficult to trace down why a particular tunnel (in this example) was configured through the various system log files. How can the operator be certain there isn’t an ever increasing set of unused tunnels that are needlessly wasting network resources due to incorrect cleanup (or garbage collection) processes?
从某种意义上说,自动化系统是网络状态的抽象。网络运营商不是直接处理任何给定设备的配置或当前运行状态,而是处理一个将预期状态解释为实际状态的软件。这种附加的抽象层使网络自动化能够消除大量设备上的复杂配置。
Automation systems are, in a sense, abstractions of the network state. Rather than dealing directly with configurations or the current running state of any given device, the network operator deals with a piece of software that interprets intended state into actual state. This added layer of abstraction is what gives network automation its power to cut through complex configurations on a large number of devices.
然而,这种抽象的另一面正在与网络的真实状态失去联系。查看单个设备的配置的工程师可能需要跟踪许多流程和代码片段,以了解特定配置的方式和原因。在大规模中断中,或者当追踪网络中断的时间非常宝贵时,这可能会出现问题。同样,这个问题可以通过仔细处理文档来解决;然而,记录在特定设备上安装特定配置的过程本身就是一种复杂性。
The other side of this abstraction, however, is losing touch with the real state of the network. An engineer looking at the configuration of a single device might need to trace through a number of processes and pieces of code to understand how and why a particular configuration is the way it is. This can be problematic in large-scale outages, or when time is at a premium in tracing down a network outage. Again, this can be resolved by taking great care with documentation; however, documenting the processes whereby a particular configuration is installed on a particular device is a form of complexity on its own.
该领域的第二个问题是无法单一来源,或者无法获得关于网络配置应该是什么以及为什么以这种方式配置的规范事实来源(将意图与部署联系起来)。工程师必须首先检查网络中部署的设备的实际配置,以了解实际配置和预期配置。他们可能会先查看自动化框架中的注释,然后最后查看存储在存储库中的某种形式的实际文档。随着时间的推移,存储在不同位置并在不同时间范围内进行管理的文档存在明显的差异趋势。由于没有“单一事实来源”,因此很难准确地知道任何特定配置的设计目的是什么,也很难知道它实际上应该如何部署。因为这,
A second problem in this space is the failure to single source, or to have a canonical source of truth about what the network configuration should be and why it’s configured this way (connecting intent to deployment). Engineers are bound to examine the actual configuration of the devices deployed in the network first to understand what is actual and intended. They might look at the comments in an automation framework second, and then finally into some form of actual documentation stored in a repository. There is a clear tendency for documentation stored in different locations, and managed within different time scales, to diverge over time. With no “single source of truth,” it’s hard to know precisely what any particular configuration was designed to do, nor how it should actually be deployed. Because of this, well-documented systems can actually provide a false view of the state of the network—a situation sometimes worse than having no (or minimal) documentation.
任何从软件角度从事过大规模系统开发的人都可以告诉您所涉及的困难 - 跟踪代码库的更改、在代码投入生产之前测试代码、管理整个功能开发过程 - 每一个都有其自己的困难一系列需要解决的复杂问题。当一个网络自动化项目最初是几个网络工程师在白板上的一个“副项目”,很难想象几年后的最终系统,包括版本控制、全面的冒烟测试系统、实验室环境等等。
Anyone who has ever worked on a large-scale system development from a software perspective can tell you the difficulties involved—tracking changes to the codebase, testing the code before it reaches production, managing the entire feature development process—each of these has its own set of complex problems to solve. When a network automation project begins as a “side project” on the whiteboard with a couple of network engineers, it’s hard to imagine the final system, years later, with version control, a full blown smoke test system, a lab environment, and so on.
虽然自动化对于控制网络设备的复杂性来说是一个巨大的好处,但它本身也可能成为无限复杂性的根源。多年来,供应商一直承诺让自动化系统更轻松地管理设备。开放接口和收入预测之间不断摇摆。也许软件定义网络和其他更新的模型将提供解锁更简单的自动化管理的钥匙,但网络工程师应该意识到在这一领域存在明确的权衡。
While automation can be a huge benefit for controlling the complexity of network devices, it can also be a source of endless complexity on its own. Vendors have, for years, promised to make the process of managing equipment easier for automation systems; there is a constant swing between open interfaces and revenue projection. Perhaps software defined networks and other, newer, models will provide the keys to unlocking simpler automated management, but there are definite tradeoffs in this area the network engineer should be aware of.
网络设备配置的自动化可能被视为类似于网络中许多其他任务的自动化。每个解决方案都有优点和缺点,并且每个解决方案都有其自己的复杂性。
Automation of network device configuration might be seen as similar to automating many other tasks in a network. Each solution has plusses and minuses, and each solution has its own set of complexities.
前面讨论的三个方面——脆弱性、网络状态和管理管理系统——是决定自动化任何特定网络配置任务以及构建自动化系统时需要考虑的因素。
The three areas discussed previously—brittleness, network state, and managing the management system—are factors that need to be considered when deciding to automate any particular network configuration task, and when building automation systems.
模块化通常用作将网络划分为多个故障域的机制,但它也是管理操作复杂性的有用机制。模块化网络有很多好处,包括能够限制网络结构的范围,限制网络变化的(不利)影响,并允许模块不仅可重用,而且可以在一定程度上独立发展。因此,模块化网络减少了元素、协议的数量以及人类在与这些网络交互时必须处理的信息量。
Modularity is often used as a mechanism to break networks up a number of failure domains, but it’s also a useful mechanism to manage operation complexity. Modularizing networks has many benefits, including the ability to constrain the scope of network structures, constrain the (adverse) effects of network changes, and allow for modules that are not only reusable but also can evolve somewhat independently. As a corollary, modularizing networks reduces the number of elements, protocols, and the amount of information that humans have to deal with when interacting with these networks.
返回到前面的示例,确定是将一组服务质量策略分发到广泛的设备上的网络边缘,还是更靠近网络核心的较小的一组设备上。该示例中考虑的问题之一是以下任一问题:
Returning to the previous example, determining whether to distribute a set of quality of service policies to the edge of a network on a wide array of devices, or on a smaller set of devices closer to the network core. One of the issues considered in that example was the problem of either:
将策略分发到每个边缘设备,从而将策略配置分发到大量不需要策略的设备
Distributing the policy to every edge device, and thus distributing the policy configuration to a large number of devices where the policy isn’t needed
或者
or
仅将策略分发到需要策略的一组设备,然后处理整个网络中缺乏一致性的问题
Distributing the policy only to the set of devices where the policy is needed, but then dealing with a lack of consistency throughout the network
模块化通过将设备分解为模块来帮助解决这个问题。然后,每个模块可以具有一组统一应用于该模块中所有边缘设备的策略。只要检查设备的网络工程师能够找到配置该设备的模块,一组通用或标准的配置就应该是显而易见的。
Modularity helps to resolve this problem by breaking devices up into modules. Each module can then have a set of policies that are applied uniformly across all the edge devices in the module. So long as a network engineer examining the device can locate the module in which the device is configured, a common or standard set of configurations should be readily apparent.
事实上,从这个意义上说,模块化是网络配置自动化所必需的服务类型抽象的附属物。将每个设备视为网络自动化系统中的“单一案例”,会将数千种配置的复杂性重新导入到自动化系统本身中,只有当可以分离出相对少量的“设备类别”时,网络自动化才有意义。其中可以抽象为单个配置。这就是模块化有助于实现的目标——设备不仅可以被分类为边缘或核心设备,还可以被分类为连接用户或连接到会计系统的进程的边缘设备(因为该设备位于主要由会计办公室使用的边缘模块)。
In fact, in this sense modularity is an adjunct to the abstraction of service types necessary to the automation of network configuration. Treating each device as a “single case” in a network automation system imports the complexity of thousands of configurations back into the automation system itself—network automation only makes sense if it’s possible to separate off a relatively small number of “devices classes,” each of which can be abstracted into a single configuration. This is what modularity helps to achieve—not only can a device be classified as an edge or core device, it can be classified as an edge device connecting a user or process that connects to accounting systems, for instance (because the device is located in an edge module used primarily by the accounting office).
这种更高的精度允许在整个网络中设计和使用更多而不是更少的“千篇一律”配置。从操作复杂性的角度来看,这个想法是能够将这些设备角色分组到模块和模板中,以便进一步能够与聚合进行交互。虽然运营商通常不会说“将策略X应用到所有边缘设备”,但他或她可能会说“将策略X应用到所有提供第 3 层虚拟专用网络 (L3VPN) 服务且数量超过Y 的边缘设备”订户数量。
This greater degree of precision allows more, rather than less, “cookie cutter” configurations to be designed and used throughout the network. The idea, from an operational complexity perspective, is to be able to group in modules and templates those device roles, to further be able to interact with the aggregate. While an operator typically will not be saying “apply policy X to all edge devices,” he or she could be saying “apply policy X to all edge devices” that provide Layer 3 Virtual Private Network (L3VPN) service and have more than Y number of subscribers.
然而,模块化并非没有复杂性的权衡。模块化可以在以下三个方向之一达到极端:
Modularity, however, is not without its complexity tradeoffs. Modularity can be taken to extremes in one of three directions:
•首先,模块化可能会被发挥到极致,试图为每种可能的极端情况构建完美的配置。如果采用这条路径,最终的模块大小通常为 1。但结果不是将每个设备视为单独的配置,而是在策略之上分层策略,在异常之上叠加异常,以至于最终的设备配置不仅是唯一的,而且是通过唯一的方法唯一创建的。一套规则。每个规则可能适用于多种设备,但每个设备都代表这些规则的唯一组合。这实际上导致的问题比它解决的问题还要多,因为它将交互界面的另一个完整领域引入到网络中——策略到策略,或域到域。
• First, modularity might be taken to the extreme of attempting to build the perfect configuration for every possible corner case. If this path is taken, the final module size is often one. But rather than treating each device as an individual configuration, the result is a layering of policy on top of policy, and exception on top of exception, to the point where the final device configuration is not only unique, it is uniquely created through a unique set of rules. Each rule might apply to a wide array of devices, but each device represents a unique combination of those rules. This actually causes more problems than it solves, as it introduces another entire realm of interaction surfaces into the network—policy to policy, or domain to domain.
其次,模块化可能会被发挥到极致,以牺牲使用网络的实际应用程序和服务为代价,将一系列可能的策略和配置管理到最低限度。在这种情况下,应用程序开发人员或用户可能会被告知他们无法在网络上部署特定服务或应用程序,因为这太麻烦了。很难围绕它调整网络策略。另一种说法是,“该应用程序可以部署,但该应用程序将以次优方式运行,因为没有简单的方法来调整我们的工具和模块以支持优化的策略要求。” 当然,总会有一些应用程序在具有一小组一致策略的网络上运行不佳,但也必须始终在策略的复杂性和所使用的抽象的“泄漏性”之间寻求平衡模块化网络以及网络上运行的实际应用程序和服务。
• Second, modularity might be taken to the extreme of managing down the array of possible policies and configurations to the minimum at the expense of the actual applications and services using the network. In this case, an application developer or user might be told they cannot deploy a particular service or application on the network because it would be too difficult to adjust the network policy around it. Another way of putting this would be, “this application can be deployed, but the application will run in a suboptimal way because there’s no easy way to adjust our tools and modules to support the policy requirements for optimization.” There will, of course, always be some set of applications that will run suboptimally on a network with a small set of consistent policies, but there must also always be a balance sought between the complexity of policy, the “leakiness” of the abstractions used to modularize the network, and the actual applications and services running on the network.
•最后,模块化可以发挥到完美干净的 API 或完美交互界面的极致,从而创建无法打破的硬面孤岛。这会通过网络设计中的僵化产生另一种形式的脆性;一旦竖井建立起来,网络就没有空间或方式来发展和解决新的业务问题。事实上,企业成为网络运营领域复杂性降低的受害者——这可能是“将复杂性扔到立方体墙上”的一个主要反面例子。
• Finally, modularity can be taken to the extreme of perfectly clean APIs, or perfect interaction surfaces, creating hard sided silos that cannot be broken down. This produces another form of brittleness through ossification in the network design; once the silos are set up, there is no space or way for the network to grow and resolve new business problems. The business becomes, in effect, a victim of complexity reduction in the network operations realm—perhaps a prime negative example of “throwing complexity over the cube wall.”
现在回到排除网络、应用程序或服务故障时的人机交互问题。假设网络运营商需要确定两个特定主机之间的流量采用什么路径。操作员确定这一点的一种方法是从主机开始,确定该特定主机的默认网关。一旦到达默认网关,操作员就可以检查本地转发表以查找下一跳;移动到下一跳设备,操作员可以跟踪通过网络的流量路径。
Return, for the moment, to the problem of human interaction in troubleshooting a network, application, or service failure. Assume the network operator needs to determine what path is traffic taking between two specific hosts. One way the operator could determine this is to start at the host, determining the default gateway for that specific host. Once at the default gateway, the operator could then examine the local forwarding table to find the next hop; moving to this next hop device, the operator could trace the path of the traffic through the network.
或者,操作员可以仅使用来自源节点的跟踪路由,并且更快、更有效地识别路径中断的位置。更深入地分析这个案例将会很有用。添加到操作员工具集中的单个工具极大地缩短了检测故障所需的时间,并且还大大减少了操作员需要交互的设备和命令的数量。应用程序跟踪路由正在利用特定的协议行为[即生存时间/跳数限制的到期、互联网控制消息协议(ICMP)消息的生成],这可以说是更复杂的协议。最终的结果是操作员接触的设备数量减少到一个。这为复杂性的转变和权衡提供了一个坚实的例子:将复杂性从操作复杂性转变为协议复杂性极大地改善了最终结果。
Alternatively, the operator could use a traceroute from the source node only, and more quickly and efficiently have a potential identification of where the path is broken. It would be useful to analyze this case in some more depth. A single tool added to the operator toolset is dramatically improving the time it takes to detect the failure, and also massively reduces the number of devices and commands that the operator needs to interact with. The application traceroute is leveraging specific protocol behavior [i.e., expiration of the Time-to-Live/Hop Limit, generation of Internet Control Message Protocol (ICMP) messages], which is arguably more protocol complexity. The end result is that the number of devices touched by the operator decreases to one. This makes for a solid example of the shifts in complexity and the tradeoffs: shifting complexity from operational complexity to protocol complexity dramatically improves the end result.
考虑一个稍微更全面的案例来说明类似的结果。假设您正在对损坏的 MPLS 标签交换路径 (LSP) 进行故障排除。操作员可以使用相同的应用程序Traceroute,不仅可以识别问题节点,还可以使用 ICMP MPLS 扩展 (RFC4950) 跟踪整个路径中的 MPLS 标签堆栈。1然而,在这种情况下,操作员还可以选择使用另一种工具:检测 MPLS 数据平面故障 (RFC4379)。2从协议复杂性的角度来看,这是一种更复杂的方法,但可以提供更细粒度的问题原因识别,并且可以更全面地探索等成本多路径。换句话说,使用 MPLS LSP Ping/Traceroute 这个新工具可以更快、更好地进行故障排除,特别是更全面的路径覆盖和实际问题原因报告。
Consider a slightly more comprehensive case that illustrates similar results. Assume you are troubleshooting a broken MPLS Label Switched Path (LSP). An operator can use the same application, traceroute, and have not only a potential identification of the problem node, but also he or she could trace the MPLS label stack throughout the path with the use of ICMP Extensions for MPLS (RFC4950).1 In this case, however, the operator could also choose to use another tool: Detecting MPLS Data Plane Failures (RFC4379).2 This is a more complex approach from a protocol complexity standpoint, but is one that can provide much more granular problem cause identification, and more comprehensively can explore equal cost multipaths. In other words, using this new tool, MPLS LSP Ping/Traceroute, can result in faster and better troubleshooting, specifically a more comprehensive path coverage and actual problem cause reporting.
1 . Ron Bonica 等人,“多协议标签交换的 ICMP 扩展”(IETF,2007 年 8 月),2015 年 9 月 15 日访问,https ://www.rfc-editor.org/rfc/rfc4950.txt 。
1. Ron Bonica et al., “ICMP Extensions for Multiprotocol Label Switching” (IETF, August 2007), accessed September 15, 2015, https://www.rfc-editor.org/rfc/rfc4950.txt.
2 . K. Kompella 和 G. Swallow,“检测多协议标签交换 (MPLS) 数据平面故障”(IETF,2006 年 2 月),2015 年 9 月 15 日访问,https: //www.rfc-editor.org/rfc/rfc4379 .txt。
2. K. Kompella and G. Swallow, “Detecting Multi-Protocol Label Switched (MPLS) Data Plane Failures” (IETF, February 2006), accessed September 15, 2015, https://www.rfc-editor.org/rfc/rfc4379.txt.
为了最终确定此用例,请继续 MPLS LSP 示例,现在假设网络中部署了 RSVP-TE(节点或路径)保护,并且还存在用于快速故障检测的双向转发检测。这些协议将在第 7 章“协议复杂性”中进行更全面的探讨,但了解它们的操作含义也很有趣。快速故障检测和自动故障修复(或旁路)实际上可以被视为“主动故障排除”,其中没有人为干预导致中断被阻止,并允许操作员有更多时间诊断和修复根本原因。
To finalize this use case, continue with the MPLS LSP example and now assume that there is RSVP-TE (node or path) protection deployed in the network, and there is also bidirectional forwarding detection in place for rapid fault detection. These protocols will be explored more fully in Chapter 7, “Protocol Complexity,” but it is also interesting to see their operational implications. A rapid fault detection and automatic fault remediation (or bypass) can actually be seen as “proactive troubleshooting,” in which no human intervention resulted in an outage prevented and allows more time for an operator to diagnose and fix the root cause.
然而,与复杂世界中的所有事物一样,现实世界中也存在许多权衡。
As with all things in the world of complexity, however, there are a number of tradeoffs in the real world.
• 协议成为必须在网络环境中进行管理的另一个“事物”。回到前面的示例,路由协议无需在网络中的每个路由器上手动配置完整的可达性信息,从而降低了复杂性。另一方面,路由协议在操作方面给网络注入了自身的复杂性,包括策略、聚合、操作和其他问题。
• The protocol becomes another “thing” which must be managed within the context of the network. To return to an earlier example, routing protocols reduce complexity by removing the requirement to manually configure full reachability information on every router in the network. On the other hand, routing protocols inject their own complexity into the network in operational terms, including policy, aggregation, operation, and other issues.
• 该协议实际上是一个“有漏洞的抽象”。例如,虽然Traceroute可以公开从一台主机到另一台主机的网络路径,但它可能不会公开特定应用程序或服务的路径真正采用的路径。例如,出于服务质量的原因,语音流量可能会被定向到与网络管理流量不同的一组链路上。
• The protocol is, in effect, a “leaky abstraction.” While traceroute, for instance, can expose the path through the network from one host to another, it might not expose the path a particular application’s or service’s path will really take. For instance, voice traffic might be directed onto a different set of links than network management traffic for quality of service reasons.
值得一提的是,不需要基于策略的路由或流量工程就会导致此处描述的类型问题。例如,假设您从设备运行跟踪路由以发现路径上大延迟的来源。结果看起来像这样:
It’s worth mentioning that it doesn’t take policy-based routing or traffic engineering to cause problems of the type described here. Assume, for instance, that you run a traceroute from a device to discover the source of a large delay along the path. The result looks something like this:
C:\> tracert example.com
1 <1 ms <1 ms <1 ms 192.0.2.45
2 4 ms 3 ms 11 ms 192.0.2.150
3 20 ms 4 ms 3 ms 198.51.100.36
4 * * * 请求超时。
5 * * * 请求超时。
6 7 毫秒 7 毫秒 7 毫秒 203.0.113.49
C:\>tracert example.com
1 <1 ms <1 ms <1 ms 192.0.2.45
2 4 ms 3 ms 11 ms 192.0.2.150
3 20 ms 4 ms 3 ms 198.51.100.36
4 * * * Request timed out.
5 * * * Request timed out.
6 7 ms 7 ms 7 ms 203.0.113.49
这两个“加星号”的啤酒花看起来确实很可疑,不是吗?但实际上,它们可能意味着任何事情,也可能毫无意义。具体有三点值得注意。首先,并非路径上的每个设备都会减少 IP 数据包上的 TTL。例如,作为交换机而不是路由器连接的数据链路层防火墙将在不修改 TTL 的情况下传递流量。其次,隧道可能表现为单跳或多跳。基于外部隧道标头交换数据包的设备(大多数情况下)不会对内部标头 TTL 执行任何操作,这意味着大量设备可能看起来是单个设备。第三,返回路径不显示在traceroute的输出中。——然而,返回路径与出站路径一样与网络性能相关。
The two “starred” hops certainly look suspicious, don’t they? In reality, though, they could mean anything—or nothing. Three specific points are worth noting. First, not every device along a path will decrement the TTL on an IP packet. Data link layer firewalls that connect as switches rather than routers will pass traffic through without modifying the TTL, for instance. Second, tunnels may appear as a single hop or many. A device that switches a packet based on an outer tunnel header will not (most of the time) do anything with the inner header TTL, which means a large set of devices can appear to be a single device. Third, the return path is not shown in the output of a traceroute—and yet the return path has as much to do with network performance as the outbound path.
在所有这三种情况下,以及其他许多情况下,网络都被抽象化,看起来比实际情况更简单。确实存在的跃点不会暴露,并且整个往返路径的一半根本不会显示。然而,这些抽象会泄漏,影响通过网络传输流量的应用程序的性能。缺乏对网络实际运行的可见性可能会对发现和解决问题以及理解网络的实际运行产生重大影响。
In all three of these cases—and many others besides—the network has been abstracted to appear simpler than it actually is. Hops that do exist are not exposed, and half of the entire round trip path is not shown at all. These abstractions leak, however, in impacting the performance of applications carrying traffic over the network. This lack of visibility into the actual operation of the network can have a major impact on finding and resolving problems, as well as in understanding the actual operation of the network.
本章涵盖了问题空间和操作复杂性问题的一些解决方案。问题空间的两个早期示例是人类与网络交互的成本,以及策略分散与最佳网络使用。这两个问题都说明了一点:网络规模与所提供服务的复杂性的关系与网络物理互连的设备数量的关系一样大。大量设备,每个设备都服务于类似的目的,因此几乎没有配置差异来支持广泛的策略,与承载广泛服务的中等规模的网络相比,具有完全不同的一组扩展问题。这个,如果不出意外的话,
This chapter has covered both the problem space and some solutions for the problem of operational complexity. The two earlier examples of the problem space are the cost of human interaction with the network, and policy dispersion versus optimal network usage. Both of these problems illustrate a single point—the scale of a network relates as much to the complexity of the services offered as it does to the number of devices physically interconnected by the network. A large number of devices, each serving a similar purpose, and therefore with few configuration differences to support a wide array of policies, have a completely different set of scaling issues than a moderately sized network carrying a wide array of services. This, if nothing else, is a useful takeaway from examining management complexity in a network—the problem of scale doesn’t have a single axis, but rather at least two axes.
在检查了一些问题空间之后,本章接着讨论了复杂性问题的一些潜在解决方案,包括自动化、模块化和协议复杂性(或添加额外的协议)。然而,每一种方法都需要权衡,从脆性到状态,再到添加更多必须管理的表面相互作用和系统。在每种情况下,一些复杂性都被“扔到隔间墙上”到其他地方进行管理。当然,自动化是真正大规模网络运行的要求,但重要的是不要将自动化用作复杂系统的一层薄薄的油漆。将复杂性抛到隔间墙上,交给开发运营团队的某人或附近友好的编码员,可能会让网络看起来更简单,
After examining some of the problem space, this chapter then looked at some of the potential solutions to the complexity problems, including automation, modularity, and protocol complexity (or adding additional protocols). Each of these has tradeoffs, however, from brittleness to state to adding more surface interactions and systems that must be managed. In each of these cases, some complexity is being “tossed over the cubicle wall” to some other place to be managed. Automation is, of course, a requirement in the operation of truly large-scale networks—but it’s important not to use automation as a thin coat of paint over what is an otherwise complex system. Tossing complexity over the cubicle wall, to someone on a development operations team, or a nearby friendly coder, might make the network seem simpler, but the complexity required to solve real-world problems doesn’t ever go away.
虽然本章不是本书的重点,但它确实为网络工程师提供了一些需要考虑的思考材料,并为工程师提供了一些寻求复杂性权衡的地方。下一章将从操作复杂性转向设计复杂性。许多相同的问题、解决方案和权衡将会出现,但从不同的角度和不同的领域表达。
While this chapter isn’t a major focus of this book, it does, hopefully, provide some thinking material for network engineers to consider, and some places for engineers to look for complexity tradeoffs. The next chapter moves from operational complexity to design complexity. Many of the same problems, solutions, and tradeoffs will appear, but expressed from a different view—and in a different domain.
任何在网络或运营领域工作了很长时间的人都会有一个关于网络拓扑的故事,这简直是令人难以抗拒。就像有一次,一家大公司的网络设计师站在会议室墙上挂着的一百多页的活动挂图前,解释其网络物理层拓扑的细节。或者通过单个园区使用数百条 T1 速度链路构建的网络,因为带宽需求早已超出了并行的 10 或 15 个 T1 的容量,但当地电信公司不会提供任何类型的更大链路。
Anyone who’s been in the network or operations world for a long time has a story about a network topology that was simply overwhelming. Like the time the network designer for one large company stood in front of a flip chart of over a hundred pages hung on the wall of a conference room explaining the ins and outs of his network’s physical layer topology. Or the network that was built with hundreds of T1 speed links through a single campus because the bandwidth requirements had long ago overrun the capacity of ten or fifteen T1s in parallel, but the local telco wouldn’t offer any sort of larger link.
网络设计中的这些大规模失败常常被认为是复杂网络中最糟糕的情况,因为网络复杂性通常等同于拓扑复杂性——有多少链路、它们放置在哪里、有多少环路等。拓扑的网状程度越高,网络就越复杂(如果在孩子的房间里找到路比网络拓扑更容易,或者如果您的网络比孩子的房间更网状,那么您就知道有问题了)。
These large-scale failures in network design are often trotted out as the worst of the worst in complex networks because network complexity is often equated with topological complexity—how many links are there, where are they placed, how many loops, etc. The more highly meshed the topology, the more complex the network (if it’s easier to find your way through your child’s room than your network topology—or if your network is meshier than your child’s room, you know you have a problem).
然而,如果将拓扑复杂性与设计复杂性等同起来,我们就会以一些不太有用的方式限制我们的视野,因为拓扑在复杂性方面并不是孤立的。构建非常密集且具有大量节点的拓扑非常容易,但又不是那么复杂。例如,考虑一个典型的折叠单级脊叶拓扑。有很多连接、很多链路和很多设备,但拓扑相当容易理解,也很容易解释。
Equating topological complexity with design complexity, however, we are limiting our field of view in some very less than useful ways—because the topology doesn’t stand alone in terms of complexity. It’s quite easy to build a topology that is very dense, and has a lot of nodes, and yet isn’t all that complex. Think, for instance, of a typical folded single stage spine-and-leaf topology. There are a lot of connections, a lot of links, and a lot of devices—but the topology is fairly easy to understand, and fairly easy to explain.
那么,难以理解的拓扑为何以及如何与网络设计复杂性相关呢?除了直觉的视觉反应之外,还有什么东西可以将复杂性和意大利面条式的拓扑联系在一起呢?
So why and how does a difficult to understand topology relate to network design complexity? Beyond the gut visual reaction, what else is there that ties complexity and spaghetti-like topologies together?
我们在本章中要考虑的是拓扑与控制平面的关系,因为这可能是更难管理的表面交互之一。交互的两个表面——传输数据包的层(例如物理层或隧道)和控制平面——本身就是非常复杂的系统,每个系统都由许多子系统组成,因此呈现出自己的深层复杂性问题。将两个复杂的系统沿着一个庞大的、定义不明确且经常变化的交互界面组合起来,最终会得到比你想要处理的更复杂的结果。
The one that we’ll consider in this chapter is the topology’s relationship with the control plane, as this is probably one of the more difficult surface interactions to manage. The two surfaces interacting—the layer that is transporting packets (such as the physical layer or a tunnel) and the control plane—are very complex systems on their own, each consisting of a number of subsystems, and hence presenting their own deep complexity problems. Combine two complex systems along a large, ill-defined, and oft-changing interaction surface, and you end up with more complexity than you want to deal with.
控制平面与传输层拓扑交互的三个主要区域如下:
Three primary areas where the control plane interacts with the transport layer’s topology are as follows:
• 控制平面状态的数量。
• The amount of control plane state.
• 控制平面状态变化的速率。
• The rate of control plane state change.
• 拓扑中的任何变化必须进行的传播范围——拓扑变化必须在控制平面内进行多远。
• The scope of propagation through which any change in the topology must be carried—how far a topological change must be carried within the control plane.
回顾第二章“复杂性的组成部分” ,考虑与复杂性的组成部分相关的这三点是很有趣的。
It’s interesting to consider these three points in relation to the components of complexity back in Chapter 2, “Components of Complexity.”
•状态:拓扑如何影响或减少控制平面正在处理的状态量?
• State: How does the topology contribute to, or take away from, the amount of state the control plane is handling?
•速度:拓扑如何影响或降低控制平面对变化做出反应的速度,或者甚至影响向控制平面报告拓扑变化的速率?
• Speed: How does the topology contribute to, or take away from, the speed at which the control plane must react to changes, or even the rate at which changes in the topology are reported to the control plane?
•表面:拓扑变化会影响控制平面中的多少设备?控制平面和数据平面或拓扑之间的交互有多深?控制平面组件有多少个地方交互?
• Surface: How many devices in the control plane does a change in the topology impact? How deep is the interaction between the control plane and the data plane, or the topology? How many places do control plane components interact?
虽然我们可以检查很多不同的点来详细了解这些相互作用,但我们将限制为四个(否则,我们可能会有一整堆书,而不是一个章节)。
While there are a lot of different points we could examine to understand these interactions in some detail, we’ll limit ourselves to four (otherwise, we could have an entire stack of books, rather than a single chapter).
第一个例子是控制平面状态与拉伸——这是一个清楚的例子,说明如何在控制平面承载的信息量以及控制平面的变化率与网络效率之间进行权衡。
The first example is control plane state versus stretch—a clear example of how the amount of information carried by the control plane, and the rate of change in the control plane, can be traded off against the efficiency of the network.
第二个例子是拓扑与收敛速度的关系。我们不仅会在这里研究弹性和冗余之间的关系,还将研究拓扑和弹性之间的关系。
The second example will be topology versus speed of convergence. Not only will we look at the relationship between resilience and redundancy here, we’ll also look at the relationship between topology and resilience.
第三个例子,快速收敛,延续了收敛速度的主题,因此网络性能也是复杂性的一个来源,以及围绕收敛的权衡。
The third example, fast convergence, continues the theme of convergence speed, and hence network performance, as a source of complexity—and the tradeoffs surrounding convergence.
我们的最后一个例子是虚拟化与设计复杂性。对于大多数网络工程师来说,即使经过多年的实践,这一点也不会那么明显,但复杂性与虚拟化网络工程师实践的各种形式之间存在一定的关系。
Our final example will be virtualization versus design complexity. This one won’t be so obvious to most network engineers, even after many years of practice, but there is a definite relationship between complexity and the various forms of virtualization network engineers practice.
什么是网络延伸?用最简单的术语来说,它是网络中最短路径与两点之间实际路径流量之间的差异。图 5.1说明了这个概念。
What is network stretch? In the simplest terms possible, it is the difference between the shortest path in a network and the path traffic between two points actually takes. Figure 5.1 illustrates this concept.
假设该网络中每条链路的成本相同,则路由器 A 和 C 之间的最短物理路径也将是最短逻辑路径:[A,B,C]。但是,如果我们将 [A,B] 链接上的度量更改为 3,会发生什么情况?最短物理路径仍然是 [A,B,C],但最短逻辑路径现在是 [A,D,E,C]。最短物理路径和最短逻辑路径之间的差异是在路由器 A 和 C 之间转发的数据包必须经过的距离 - 在这种情况下,拉伸可以计算为 (4 [A,D,E,C])– (3 [A,B,C]),拉伸 1。
Assuming that the cost of each link in this network is the same, the shortest physical path between Routers A and C will also be the shortest logical path: [A,B,C]. What happens, however, if we change the metric on the [A,B] link to 3? The shortest physical path is still [A,B,C], but the shortest logical path is now [A,D,E,C]. The differential between the shortest physical path and the shortest logical path is the distance a packet being forwarded between Routers A and C must travel—in this case, the stretch can be calculated as (4 [A,D,E,C])–(3 [A,B,C]), for a stretch of 1.
关于拉伸有几点值得注意:
Several points are worth noting about stretch:
• 有时很难区分物理拓扑和逻辑拓扑。在这种情况下,[A,B] 链路指标是否增加是因为该链路实际上是较慢的链路?如果是这样,那么这是否是拉伸的示例,还是简单地使逻辑拓扑与物理拓扑保持一致的示例是值得商榷的。
• It’s sometimes difficult to differentiate between the physical topology and the logical topology. In this case, was the [A,B] link metric increased because the link is actually a slower link? If so, whether this is an example of stretch, or an example of simply bringing the logical topology in line with the physical topology is debatable.
• 根据这一观察结果,用弹性来定义政策比几乎任何其他方式都要容易得多。策略是任何增加网络延伸的配置。例如,使用基于策略的路由或流量工程将流量从最短的物理路径推送到较长的逻辑路径以减少特定链路上的拥塞,这是一种策略,它可以增加伸展度。
• In line with this observation, it’s much easier to define policy in terms of stretch than almost any other way. Policy is any configuration that increases the stretch of a network. Using Policy Based Routing, or Traffic Engineering, to push traffic off the shortest physical path and onto a longer logical path to reduce congestion on specific links, for instance, is a policy—it increases stretch.
• 增加伸展力并不总是坏事。理解拉伸的概念可以帮助我们理解各种其他概念,并围绕复杂性权衡建立一个框架。从物理角度来说,最短路径并不总是最佳路径。
• Increasing stretch is not always a bad thing. Understanding the concept of stretch simply helps us understand various other concepts, and put a framework around complexity tradeoffs. The shortest path, physically speaking, isn’t always the best path.
• 在此图中,拉伸非常简单——它会影响每个目的地以及流经网络的每个数据包。在现实世界中,事情并不那么简单。拉伸实际上是按源/目标对进行的,因此很难在网络范围内进行测量。
• Stretch, in this illustration, is very simple—it impacts every destination, and every packet flowing through the network. In the real world, things aren’t so simple. Stretch is actually per source/destination pair, making it very difficult to measure on a network wide basis.
考虑到所有这些,让我们看一下拉伸和优化之间权衡的两个具体示例。
With all of this in mind, let’s look at two specific examples of the tradeoff between stretch and optimization.
聚合是一种不仅可以减少控制平面中承载的信息量,还可以减少控制平面中的状态变化率的技术。聚合内置于 IP(IPv4 和 IPv6)中,即使单个子网也包含多个主机地址。通过将单个广播段连接到一组主机,IP 路由协议不需要管理第 2 层可达性,也不需要管理单个主机地址。
Aggregation is a technique used to reduce not only the amount of information carried in the control plane, but also the rate of state change in the control plane. Aggregation is built into IP (both IPv4 and IPv6)—even a single subnet contains multiple host addresses. By connecting a single broadcast segment to a set of hosts, the IP routing protocol doesn’t need to manage Layer 2 reachability, nor individual host addresses.
笔记
Note
主机地址有时用于在较大的网络中提供移动性,特别是数据中心结构和移动自组织网络 - 这是我们在此不考虑的状态与延伸权衡的说明。
Host addresses are sometimes used to provide mobility within a larger network, particularly data center fabrics and mobile ad-hoc networks—an illustration of the state versus stretch tradeoff we won’t consider here.
控制平面内的聚合也会导致拉伸,如图5.2所示。
Aggregation within the control plane can also cause stretch, as Figure 5.2 shows.
大多数路由协议都会考虑聚合组件的度量,以在此类聚合边缘提供尽可能接近最佳的路由。然而,即使采取了这些应对措施,将拉伸引入网络仍然很常见。该示例无需考虑组件度量,以使原理更容易说明。
Most routing protocols take into account the metrics of an aggregate’s components to provide as close to optimal routing as possible at an aggregation edge of this type. However, it’s still common for stretch to be introduced into the network even with these counter measures. This example dispenses with taking component metric into account to make the principle easier to illustrate.
两种不同的情况说明了通过路由聚合增加延伸:
Two different situations illustrate increasing stretch through route aggregation:
• 假设[A,B] 链路的成本为2,该网络中所有其他链路的成本为1。如果路由器B 和C 都聚合到2001:db8::/61,则通过[ A,C] 对于聚合中的所有内容都是首选。发往 2001:db8:0:1::/64 的流量将沿着路径 [A,C,E,D] 到达目的地,即使最短(物理)路径是 [A,B,D]。2001:db8:0:2::/64 的拉伸不变,但 2001:db8:0:1::/64 的拉伸增加了 1。
• Assume the [A,B] link has a cost of 2, and all the other links in this network have a cost of 1. If Routers B and C both aggregate to 2001:db8::/61, then the path through [A,C] would be preferred for everything within the aggregate. Traffic destined to 2001:db8:0:1::/64 will pass along the path [A,C,E,D] to reach its destination, even though the shortest (physical) path is [A,B,D]. The stretch for 2001:db8:0:2::/64 isn’t changed, but the stretch for 2001:db8:0:1::/64 is increased by one.
• 假设网络中所有链路的成本均为 1。如果路由器 B 和 C 都聚合到 2001:db8::/61,则路由器 A 将以某种方式将流量负载共享到路由器 D 和 E 后面的两个子网。它具有可用的等价路径。假设完美的负载共享,50% 发往 2001:db8:0:1::/64 的流量将沿 [A,C,E,D] 流动,延伸段为 1,50% 发往 2001:db8:0:1::/64 的流量将沿 [A,C,E,D] 流动2001:db8:0:2::/64 将沿 [A,B,D,E] 流动,拉伸长度为 1。
• Assume all the links in the network have a cost of 1. If Routers B and C both aggregate to 2001:db8::/61, then Router A will somehow load share traffic toward the two subnets behind Routers D and E across the two equal cost paths it has available. Given perfect load sharing, 50% of the traffic destined to 2001:db8:0:1::/64 will flow along [A,C,E,D], with a stretch of 1, and 50% of the traffic destined to 2001:db8:0:2::/64 will flow along [A,B,D,E], with a stretch of 1.
现在应该想到的问题是——那又怎样?通过聚合(状态的实际数量和该状态的变化率)减少了控制平面状态的数量,这明显降低了复杂性。这里需要权衡三种结果:
The question that should come to mind about now is this—so what? The amount of control plane state is decreased through aggregation (both the actual amount of state and the rate of change in that state), which clearly reduces complexity. There are three outcomes being traded off here:
• 实施聚合可以打破故障域,提高控制平面(进而提高整个网络)的稳定性和弹性。
• Implementing aggregation breaks up failure domains, improving the stability and resilience of the control plane (and hence of the entire network).
• 实施聚合需要设计、配置和维护围绕聚合本身的策略集。这是额外的复杂性。
• Implementing aggregation requires designing, configuring, and maintaining the set of policies around the aggregation itself. This is additional complexity.
• 在此网络中实施聚合会导致通常的两条冗余链路变成单点故障。如果Router B是到达聚合组2001:0db8::/61的首选路由,并且Router B之间的链路路由器 D 和 E 发生故障,2001:0db8:0001::/64 将无法从路由器 A 到达。发往此前缀的流量将转发到路由器 C,因为聚合覆盖了此地址空间,然后被丢弃,因为路由器 C 将有没有到特定子网的路由。为了解决这个问题,必须安装一条新路径,提供从路由器 C 到路由器 D 的备用路径。通常,该链路将安装在路由器 B 和 C 之间,并配置为位于聚合“后面”,而不是“前面” of”聚合(从路由器 C 的角度来看,聚合必须配置到路由器 A,而不是路由器 B)。解决聚合黑洞问题会增加多个领域的复杂性。
• Implementing aggregation in this network causes what would normally be two redundant links to become a single point of failure. If Router B is the preferred route to the aggregate 2001:0db8::/61, and the link between Routers D and E fails, 2001:0db8:0001::/64 will become unreachable from Router A. Traffic destined to this prefix will be forwarded to Router C, because the aggregate covers this address space, and then dropped because Router C will have no route to the specific subnet. To remedy this, a new path must be installed providing an alternate path to Router D from Router C. Normally, this link would be installed between Routers B and C, and configured so it is “behind” the aggregation, rather than “in front of” the aggregation (aggregation must be configured toward Router A from Router C’s perspective, not Router B). Solving this aggregation black hole problem increases complexity in several areas.
• 增加延伸会增加数据平面的复杂性,因为流量会经过更多的跃点,从而增加队列等。
• Increasing stretch increases the complexity of the data plane by pushing traffic through more hops, and hence more queues, etc.
• 增加拉伸会断开网络明显/表观的操作与网络实际工作的方式。随着增加延伸的策略的实施,查看网络拓扑并了解流量如何流经网络变得更加困难。这增加了对网络操作中可能(事实上,几乎肯定会)出现的各种问题进行故障排除的复杂性和难度。
• Increasing stretch disconnects the obvious/apparent operation of the network from the way the network actually works. As policies are implemented that increase stretch, it becomes more difficult to look at the network topology and understand how traffic flows through the network. This increases the complexity and difficulty of troubleshooting various problems that might (in fact, almost certainly will) arise in operating the network.
• 增加延伸可提高网络的整体利用率,而不会实际增加通过网络传输的流量。在图 5.1给出的示例中,通常采用两跳路径的流量被定向为三跳路径,这意味着转发和交换流中的数据包涉及多一条链路和多一台路由器通过网络。从纯粹的数学角度来看,增加拉伸会增加用于转发任何特定流的设备和链路的数量,从而降低网络的整体效率。
• Increasing stretch increases the overall utilization of the network without any actual increase in the amount of traffic being carried through the network. In the example given in Figure 5.1, traffic that would normally take a two-hop path is directed along a three-hop path, which means one more link and one more router are involved in forwarding and switching the packets in the flow(s) across the network. In purely mathematical terms, increasing stretch decreases the overall efficiency of the network by increasing the number of devices and links used to forward any particular flow.
最后一点很有趣,因为它切入了我们下一次拉伸与控制平面状态比较的核心,即流量工程如何影响复杂性。
This last bullet is an interesting point—for it cuts to the heart of our next comparison of stretch versus control plane state, how traffic engineering impacts complexity.
让我们看一个更具体的流量工程状态交易示例。我们将使用图 5.3中的网络作为示例。
Let’s look at a more specific example of trading state for traffic engineering. We’ll use the network in Figure 5.3 as an example.
假设该网络中每条链路的成本为1;2001:db8:0:1::/64 和 2001:db8:0:2::/64 之间的最短路径是 [A,B,F]。但是,那网络管理员不希望这两个子网之间的流量沿着该路径传输,例如,正在交换的流量可能是高带宽文件传输流,并且 [A,B] 链路已用于视频流。为了防止两个流“冲突”,从而导致服务质量问题,网络管理员在路由器 A 上实施某种形式的基于策略的路由,通过路径 [A,C,E,F] 重定向文件传输流量。
Assume that every link in this network has a cost of 1; the shortest path between 2001:db8:0:1::/64 and 2001:db8:0:2::/64 is [A,B,F]. However, the network administrator doesn’t want traffic between these two subnets traveling along that path—for instance, perhaps the traffic being exchanged is a high bandwidth file transfer stream, and the [A,B] link is already used for a video stream. To prevent the two streams from “colliding,” causing quality of service issues, the network administrator implements some form of policy-based routing at Router A that redirects the file transfer traffic through the path [A,C,E,F].
考虑一下这涉及到什么:
Consider what this involves:
• 现在必须在路由器A 和F 上创建并安装策略,以重定向[A,B,F] 路径周围的特定流量。
• A policy must now be created and installed on Routers A and F to redirect this specific traffic around the [A,B,F] path.
• 文件传输流量的扩展已增加一。
• The stretch for the file transfer traffic has been increased by one.
• 这种额外的延伸意味着流量现在将经过四跳,而不是三跳,这意味着必须管理四个输入队列、四个输出队列、四个转发表等,而不是三个。
• This additional stretch means the traffic is now going to pass through four hops, rather than three, which means four input queues, four output queues, four forwarding tables, etc., must be managed, instead of three.
• 从拓扑上看,任何给定流通过网络的路径不再明显;为了了解特定流的路径,网络工程师解决问题必须检查控制平面策略,以确定每种特定情况下会发生什么。
• The path of any given stream through the network is no longer obvious from the topology; to understand the path of a specific stream, a network engineer troubleshooting problems must examine the control plane policies to determine what will happen in each specific case.
每一个都增加了一点复杂性——事实上,通过这个练习,数据平面和控制平面的复杂性都增加了。那么我们得到了什么?为什么要做这样的交通工程?
Each of these is an added bit of complexity—in fact, both the data plane and the control plane complexity have increased through this exercise. So what have we gained? Why do this sort of traffic engineering?
因为即使网络的整体效率(就利用率而言)因沿着具有一定延伸的路径重定向某些流量而降低,但网络的整体利用率在其处理负载的能力方面有所增加,或者更确切地说是支持一组特定的应用程序。为了把这个换句话说,尽管整体效率较低,但网络现在的利用率较高。这可以防止网络运营商被迫安装新链路或升级容量——从运营商的角度来看,这都是好事。
Because even though the overall efficiency of the network (in terms of utilization) has decreased by redirecting some traffic along a path with some stretch, the overall utilization rate of the network has increased in terms of its ability to handle load, or rather to support a specific set of applications. To put this in other terms, the network now has a higher utilization rate even though the overall efficiency is lower. This prevents the network operator from being forced to install new links, or upgrade capacity—both good things from the operator’s point of view.
因此,在状态与拉伸的情况下,每当我们增加拉伸时,我们都在实施某种策略,而每当我们实施策略时,我们都在以某种方式增加复杂性。另一方面,为了减少控制平面中的状态或提高整体网络利用率,通常需要创建延伸策略。
So in the case of state versus stretch, any time we are increasing stretch, we are implementing some sort of policy, and any time we are implementing a policy, we are increasing complexity in some way. On the other hand, creating stretch through policy is often necessary to either reduce state in the control plane, or to improve overall network utilization.
那么,状态与伸展是一组权衡。“我应该实施这项增加弹性的政策吗?”这个问题没有绝对正确的答案。正确的答案始终取决于目标和预计的后果。在我们进入下一节考虑拓扑设计和收敛之间的权衡之前,将状态与拉伸联系回第 2 章“复杂性的组成部分”中概述的状态、速度、表面框架非常重要:
State versus stretch is, then, a set of tradeoffs. There is no absolutely right answer to the question, “Should I implement this policy that increases stretch?” The right answer is always going to depend on the goals and the projected consequences. Before we jump into the next section considering the tradeoff between topology design and convergence, it’s important to tie state versus stretch back to the state, speed, surface framework outlined in Chapter 2, “Components of Complexity”:
•状态:控制平面中的信息量或者通过聚合减少,或者通过流量工程增加。在一种情况下,控制平面的复杂性正在降低,而网络复杂性(在数据平面中)却在增加,至少在一定程度上是增加的。
• State: The amount of information in the control plane is either decreased through aggregation, or increased through traffic engineering. In the one case, the amount of control plane complexity is decreasing, while the network complexity (in the data plane) is increasing, at least in some amount.
•速度:聚合通过隐藏网络中的变化来降低控制平面必须做出反应的速度。第 6 章“管理设计复杂性”更详细地讨论了这一点。向控制平面添加更多状态来设计流量会产生与聚合相反的效果;单个链路故障可能意味着控制平面中的多次更新,因此控制平面必须更快地做出反应。
• Speed: Aggregation reduces the speed at which the control plane must react by hiding changes in the network. Chapter 6, “Managing Design Complexity,” looks at this in more detail. Adding more state to the control plane to engineer traffic has an effect that’s the opposite of aggregation; a single link failure can mean multiple updates in the control plane, so the control plane must react more quickly.
•表面:聚合以某些方式减少了控制平面和拓扑之间的交互表面(通过对控制平面隐藏拓扑的某些部分),并以其他方式增加了交互表面(通过实施必须与特定的控制平面策略一致)网络内的拓扑特征。流量工程增加了控制平面和拓扑之间的交互面。
• Surface: Aggregation reduces the interaction surface between the control plane and the topology in some ways—by hiding some parts of the topology from the control plane—and increases the interaction surface in other ways—by implementing control plane policy that must coincide with specific topological features within the network. Traffic engineering increases the interaction surface between the control plane and topology.
事实证明,状态、速度和表面框架对于诊断复杂性增加、复杂性降低以及权衡所在的各个点非常有用。
The state, speed, and surface framework proves useful in diagnosing the various points where complexity has increased, where complexity has decreased, and where the tradeoffs lie.
延伸并不是网络拓扑和控制平面相互交互的唯一地方。拓扑的实际布局也是一个(通常不那么明显)交互点。让我们考虑拓扑和控制平面之间交互的两个具体示例:环形拓扑收敛以及冗余与弹性。
Stretch isn’t the only place where the network topology and the control plane interact with one another. The actual layout of the topology is a (often not so obvious) point of interaction, as well. Let’s consider two specific examples of interactions between the topology and the control plane: ring topology convergence, and redundancy versus resilience.
环形拓扑(如图5.4所示)具有一组相当具体的收敛特性。
Ring topologies, such as the one illustrated in Figure 5.4, have a fairly specific set of convergence characteristics.
如果[D,2001:db8:0:1::/64]链路出现故障,该拓扑上距离矢量协议的收敛过程如何?
If the [D, 2001:db8:0:1::/64] link fails, what is the convergence process for a distance vector protocol on this topology?
1. RouterD发现故障。
1. Router D discovers the failure.
2. RouterD将故障通告给RouterC和E。
2. Router D advertises the failure to Routers C and E.
3.路由器 C 和 E 向 B 和 F 通告 2001:db8:0:1::/64 的链路丢失。
3. Routers C and E advertise the loss of the link to 2001:db8:0:1::/64 to B and F.
4.路由器B和F向路由器A通告2001:db8:0:1::/64的链路丢失。
4. Routers B and F advertise the loss of the link to 2001:db8:0:1::/64 to Router A.
5.到 2001:db8:0:1::/64 的路由将从所有本地路由表中删除,并丢弃发往该子网的流量。
5. The route to 2001:db8:0:1::/64 is removed from all the local routing tables, and traffic destined to this subnet is dropped.
What about with a link state protocol?
1. RouterD发现故障。
1. Router D discovers the failure.
2.路由器 D 将修改后的链路状态信息洪泛到网络中(或至少在洪泛域内)的其余路由器。
2. Router D floods modified link state information to the rest of the routers in the network (or at least within the flooding domain).
3.然后,每个路由器设置一个定时器来延迟运行 SPF。
3. Each router then sets a timer to delay running SPF.
4.当每个路由器运行SPF时,它将重新计算到2001:db8:0:1::/64的最佳路径并安装它。
4. As each router runs SPF, it will recalculate the best path to 2001:db8:0:1::/64 and install it.
5.由于环内的路由器在不同的时间运行SPF,它们会在不同的时间安装通往该目的地的新路径,从而导致微环,而整个环的路由表处于不一致的状态。(请注意,在本例中,删除路线后不会发生这种情况。)
5. As routers in the ring run SPF at different times, they will install the new path toward this destination at different times, resulting in microloops while the routing tables throughout the ring are in an inconsistent state. (Note this won’t happen with a route removal, as in this case.)
6.经过一段时间后,所有路由器都运行了 SPF,重新计算了到达目的地的最佳路径,并安装了路由。
6. After some period of time, all routers have run SPF, recalculated their best paths to the destination, and installed the routes.
因此,无论您使用哪种类型的路由协议,当有关特定目的地的更改的信息通过环传播并起作用时,都会存在一些时间滞后。您基本上可以在收敛期间选择丢弃或循环流量。
So no matter which type of routing protocol you’re using, there is some time lag as the information about changes to specific destination is propagated and acted on through the ring. You essentially have your choice between dropping or looping traffic during convergence.
那么,为什么网络设计者经常使用环呢?主要原因如下:
Why, then, do network designers often use rings? The primary reasons are as follows:
• 环是连接最少的丰富拓扑,提供辅助路径(更正式的术语是两个连通图)。
• Rings are the least connection rich topology that offers a secondary path (a two connected graph, in more formal terms).
• 由于环的连接并不丰富,因此在环拓扑上不存在邻居数量等的扩展问题——无论环发展到多大,每个路由器都只会有两个邻居。
• Because rings are not connection rich, there are no scaling issues with the number of neighbors, etc., on a ring topology—no matter how large the ring grows, each router will only have two neighbors.
• 环适用于通过最少的(昂贵的)链路跨越较长的地理距离。
• Rings are good for spanning long geographic distances with a minimal set of (expensive) links.
• 就复杂性而言,环形拓扑为网络(路由或IP)控制平面添加的负载要少得多。
• In terms of complexity, ring topologies add a lot less load to the network’s (routed or IP) control plane.
三角形(具有三跳的较大拓扑的任何部分)比环(具有四跳或更多跳的拓扑的任何部分)收敛得快得多,但三角形在控制平面上施加了更多负载 - 特别是在信息量方面每个节点接收。那么,到这里,我们就清楚了拓扑、控制平面和收敛速度之间相互作用的复杂性权衡。
Triangles (any part of a larger topology with three hops) converge much faster than rings (any part of a topology with four or more hops), but triangles place a lot more load on the control plane—particularly in terms of the amount of information each node receives. Here, then, we have a clear complexity tradeoff in the interaction between topologies, control planes, and convergence speed.
让我们看看网络拓扑图谱的另一端,并以非常密集连接的拓扑为例。
Let’s go to the opposite end of the network topology spectrum and example very densely connected topologies.
假设您需要构建一个具有六个 9 可靠性(99.9999% 正常运行时间)的网络,并且您拥有的唯一可用链路在任何给定年份的平均停机时间约为 3.5 天(99% 可靠性)。用低可靠性部件提供高可用性的简单解决方案是将它们中的许多部件并联。表5.1给出了增加并机数量与系统可靠性之间的关系。
Assume that you need to build a network with six nines of reliability (99.9999% uptime), and the only links you have available have an average downtime of about 3.5 days in any given year (99% reliability). The easy solution to providing high availability with low reliability parts is to put many of them in parallel. Table 5.1 shows the relationship between increasing the number of parallel units and the reliability of the system.
很简单——只需确保网络中任何地方至少有三个并行链路,并且您就有六个九的可靠性,对吧?也许。让我们为路由器添加冗余,然后尝试用它构建一些东西。图 5.6说明了三重冗余全网状拓扑。
Easy—just make certain that there are at least three links in parallel anyplace in the network, and you have six nines of reliability, right? Well, maybe. Let’s add in redundancy for our routers, and then try to build something out of it. Figure 5.6 illustrates a triple redundant full mesh topology.
从链路和路由器的角度来看,这种设计确实满足了六个九的要求,但是控制平面的收敛速度有多快?它甚至可能在“理想”条件下工作,但是当一个链路以导致快速抖动的方式发生故障时,这种设计就会产生有趣的问题。在 EIGRP 上进行的实验室测试中,该协议实际上开始在大约四个并行链路(在具有数万条路由的非常简单的拓扑上)处失去收敛速度。
This design certainly meets the six nines requirement from the perspective of links and routers—but how fast is the control plane going to converge? It might even work under “ideal” conditions, but when one link fails in a way that causes a rapid flap, interesting problems are going to result from this sort of design. In lab testing done on EIGRP, the protocol actually started losing convergence speed at around four parallel links (on a very simple topology with tens of thousands of routes).
网络中泛滥的控制平面流量数量可能会阻碍链路状态协议收敛。例如,如果 (A,2001:db8:0:1::/64) 链路失败,则修改后的链路状态更新将在整个三重全网格中多次泛洪,除非某种泛洪采用减速机制。事实上,这就是 IS-IS 和 OSPF 中存在网状组的原因,以防止全网状网络上发生大规模洪泛事件。除此之外,在此拓扑上收敛期间创建的微环的数量和位置可能很大且难以预测。
Link state protocol convergence could be hampered by the number of control plane traffic flooded through the network. For instance, if the (A,2001:db8:0:1::/64) link fails, the modified link state update will be flooded throughout the triple full mesh multiple times, unless some sort of flooding reduction mechanism is used. This is, in fact, why mesh groups exist in IS-IS and OSPF—to prevent large flooding events on full mesh networks. Beyond this, the number and location of microloops created during convergence on this topology could be large and difficult to predict.
降低此类拓扑复杂性的一种方法,同时保持三重冗余以实现弹性,可能是将路由器之间的每组链路视为单个链路,这可以使用链路聚合组 (LAG) 等技术来实现。 ), 例如。细心的读者已经意识到这里也存在权衡。首先,LAG有自己的控制平面,并对转发平面进行修改,以确保每个并行电路的平等使用。它们将与覆盖的 IP 路由协议交互,创建另一个交互表面。管理此交互表面的困难的一个例子是决定如何处理任意两个路由器之间的三个可用链路中的单个链路的故障。是否应该在路由协议中修改单个聚合链路上的度量?还是应该以同等成本使用该链接?从交通流和负载的角度来看,每种解决方案有何影响?一如既往,这些问题都不容易回答——虽然这样的解决方案可能很适合特定情况,但在复杂性和网络流量优化方面总是需要权衡。
One way to lower the complexity of this type of topology, while maintaining triple redundancy for resilience, might be to treat each of the sets of links between the routers as a single link—this would be possible using a technology like Link Aggregation Groups (LAGs), for instance. Observant readers are already going to realize that there are tradeoffs here, as well. First, the LAG has its own control plane, and makes modifications to the forwarding plane, to ensure equal use of each of the parallel circuits. These will interact with the overlaying IP routing protocol, creating another interaction surface. One example of the difficulty in managing this interaction surface is deciding how to handle the failure of a single link out of the three available between any two routers. Should the metric on the single aggregated link be modified in the routing protocol? Or should the link be used at equal cost anyway? What are the implications of each solution from a traffic flow and loading perspective? These are, as always, not easy questions to answer—while such a solution might fit a particular situation well, then, there are always tradeoffs to consider in terms of complexity and optimization of traffic flow through the network.
因此,这取决于很多因素,但是如果没有对计时器进行认真的调整,并且在此过程中可能会减少一些控制平面状态,那么这里的路由协议可能不会在所有可能的条件下收敛到六个九的要求。
So it depends on a lot of factors, but the routing protocol here is probably not going to converge within the six nines requirement in all possible conditions without some serious tuning of timers, and potentially some reduction of control plane state along the way.
这里的权衡在于冗余和弹性之间——虽然添加冗余链路似乎总是会提高网络弹性,但现实世界中的情况并非如此。每个冗余链路都会增加一点控制平面状态,并在收敛期间增加一点负载。每增加一点负载都会减慢控制平面的收敛速度,在某些情况下甚至会导致控制平面无法收敛。
The tradeoff here is between redundancy and resilience—while it might seem that adding redundant links will always increase network resilience, this simply isn’t the case in the real world. Each redundant link adds a bit more control plane state, and a bit more load during convergence. Each added bit of load slows down the convergence speed of the control plane, and, in some cases, can even cause the control plane to fail to converge.
让我们回到我们的状态、速度和表面模型,看看拓扑与速度权衡如何以及在何处适合。此处它们的呈现顺序略有不同。
Let’s go back to our state, speed, and surface model to see how and where the topology versus speed tradeoff fits. They will be presented in a slightly different order here.
•状态:随着网络中内置冗余的增加,链路数量和邻接/对等会话数量必须增加,才能发现额外的链路和跨这些链路的邻居关系。这反过来又增加了控制平面中承载的信息量以及整个网络中该数据的复制数量。
• State: As the redundancy built into the network increases, the number of links and the number of adjacencies/peering sessions must increase to discover the additional links and neighbor relationships across those links. This, in turn, increases both the amount of information carried in the control plane and the number of replications of that data throughout the network.
•表面:随着网络中设备数量的增加,参与控制平面的设备之间的邻接或对等关系的数量也必须增加。对等关系的增加代表交互面的增加。任何特定的信息都必须通过更多的设备来承载。
• Surface: As the number of devices in the network increases, the number of adjacencies or peering relationships among devices participating in the control plane must also increase. The increase in peering relationships represents an increase in interaction surfaces. Any particular piece of information must be carried through more devices.
•速度:随着网络内置冗余的增加,有关网络状态的任何特定信息必须复制的次数也会增加。如果拓扑中的任何变化需要控制平面收敛,以便内部数据库的状态与现实世界的状态相匹配(用拓扑术语来说),则总状态中包含的每条附加信息要么需要收敛更长的时间,要么在相同的时间内要分发和计算更多的状态。如果我们期望网络在相同的时间内或更快地收敛更多信息,那么我们期望控制平面能够更快地处理信息。在相同的时间内(或更快)处理的较大数据库需要更高的速度或每秒更多的事务。
• Speed: As the redundancy built into the network increases, the number of times any particular piece of information about the state of the network must be replicated increases. If any change in the topology requires the control plane to converge so the state of the internal databases matches the state of the real world (in topological terms), each additional piece of information included in the total state either requires convergence to take longer, or a larger amount of state to be distributed and calculated across in the same amount of time. If we expect the network to converge in the same amount of time or faster with more information, then we are expecting the control plane to process information more quickly. Larger databases processed in the same amount of time (or faster) require more speed, or more transactions per second.
关于速度的最后一点将作为下一节研究速度和复杂性之间关系的延续——快速收敛。
This last point on speed will operate as a segue into the next section investigating the relationship of speed and complexity—fast convergence.
当网络技术首次普及时,快速融合意味着速度足够快,可以允许文件传输或电子邮件通过网络而不会造成太多中断。随着网络变得越来越快,部署在网络之上的应用程序是围绕网络可以提供的更快的速度和更高的可用性而构建的。曾经可接受的收敛速度现在对于许多应用来说被认为太慢。考虑下图,显示了最初部署时各种路由协议的理论收敛时间,使用默认计时器并且没有“技巧”,例如无环路替代(LFA)或快速重新路由机制。图 5.7说明了使用默认计时器和“无技巧”的路由协议收敛速度。
When networking technology first became widespread, fast convergence meant fast enough to allow a file transfer or email to make its way through the network without too much interruption. As networks have become faster, the applications deployed on top of them have been built around the faster speeds and higher availability networks can deliver. Once acceptable convergence speeds are now considered too slow for many applications. Consider the following illustration showing the theoretical convergence times of various routing protocols as they were originally deployed, using default timers and no “tricks,” such as loop-free alternates (LFAs) or fast reroute mechanisms. Figure 5.7 illustrates routing protocol convergence speeds with default timers and “No Tricks.”
图 5.7 使用默认计时器和“无技巧”的路由协议收敛速度
Figure 5.7 Routing Protocol Convergence Speeds with Default Timers and “No Tricks”
让我们看看每个协议,以了解此处显示的收敛时间。
Let’s look at each protocol to get a sense for the convergence times shown here.
• 路由信息协议(RIP) 使用定期更新来通知邻居可达性的变化(RIP 并不真正携带拓扑信息,尽管可以从数据库推断出一些有关拓扑的信息)。此定期更新的“默认”计时器为 30 秒;平均(统计),任何给定的可达性变化都会等待 15 秒,然后才会被通告给相邻路由器。那么,通知整个网络可达性变化的平均时间就是可达性变化必须经过的最大跳数乘以 15 秒。对于大型网络,这可能意味着两到三分钟。
• Routing Information Protocol (RIP) uses a periodic update to inform neighbors about a change in reachability (RIP doesn’t really carry topology information, though some information about the topology can be inferred from the database). The “default” timer for this periodic update is 30 seconds; on average (statistically), any given change in reachability will wait 15 seconds before being advertised to a neighboring router. The average time to notify the entire network of a change in reachability, then, is the maximum number of hops the reachability change must travel multiplied by 15 seconds. For large networks, this could mean two or three minutes.
• OSPF 是一种链路状态协议,因此拓扑和/或可达性的任何变化都会传播到整个网络。每个接收到更新的 LSA 的路由器将独立计算 SPT(使用 SPF 计算),这意味着(理论上)与洪泛域中的所有其他路由器并行。原来影响OSPF收敛的定时器是LSA生成定时器和SPF计算定时器。这两个计时器相结合,可以使标准的、未改变的 OSPF 的最短收敛时间约为一秒。此处显示的最大收敛时间约为 30 秒,这是在整个网络中发现和通告邻居故障所需的最长时间。
• OSPF is a link state protocol, so any change in topology and/or reachability is flooded through the entire network. Every router that receives an updated LSA will compute the SPT (calculated using SPF) independently, which means (theoretically) in parallel with all the other routers in the flooding domain. The original timers that impacted OSPF convergence are the LSA generation timer and the SPF calculation timer. These two timers, combined, result in a minimum convergence time for standard, unaltered OSPF of around one second. The maximum convergence time is shown here to be around 30 seconds, the maximum amount of time a neighbor failure will take to be discovered and advertised throughout the network.
• EIGRP 是一种距离矢量协议,它在本地数据库中保存一跳拓扑信息,并使用该信息在可能的情况下预先计算无环路路径。如果有备用无环路路径可用,则 EIGRP 可以在检测到链路故障后不到 100 毫秒 (ms) 内收敛。在没有 LFA(称为可行后继者,或 FS)的情况下,EIGRP 将通过网络传播查询;平均而言,此查询每跳需要大约 200 毫秒来处理。那么,EIGRP 收敛所需的平均时间为 200 毫秒乘以必须参与查询过程的路由器数量——由网络设计和配置调节的数量。最大收敛时间由 EIGRP Stuck in Active 定时器设置,通常为 90 秒左右。
• EIGRP is a distance vector protocol that holds one hop of topology information in a local database, using this information to precalculate loop-free paths where possible. If an alternate loop-free path is available, EIGRP can converge in less than 100 milliseconds (ms) after a link failure has been detected. In the absence of a LFA (called a Feasible Successor, or FS), EIGRP will propagate a query through the network; on average, this query will require about 200 ms per hop to process. The average amount of time EIGRP requires to converge is, then, 200 ms multiplied by the number of routers that must participate in the query process—a number regulated by network design and configuration. The maximum convergence time is set by the EIGRP Stuck in Active timer, which is normally around 90 seconds.
笔记
Note
为什么SPT计算理论上是并行的?由于一台路由器向其每个邻居洪泛数据包的速率以及通过多跳网络洪泛新信息所需的累积时间略有差异,因此无法确保网络中的所有路由器都收到新的拓扑信息恰好在同一时刻。鉴于此,也无法确保网络中的每个路由器都在同一时刻开始计算新的 SPT。这最终是链路状态协议收敛期间微循环的原因。
Why is SPT calculation theoretically in parallel? Because of slight differences in the rate at which one router can flood packets to each of its neighbors, and the cumulative time required to flood new information through a multihop network, there is no way to ensure all the routers in the network receive new topology information at precisely the same moment. Given this, there is also no way to ensure that every router in the network will begin calculating a new SPT at precisely the same moment. This is, ultimately, the cause of microloops during link state protocol convergence.
从图 5.7所示的应用程序需求中可以看出,这些收敛时间无法满足当今网络的预期。如何改善这些协议的收敛时间?这个过程的第一步是从说话转向更快说话。
As you can see from the application requirements shown in Figure 5.7, these convergence times just won’t supply what is expected out of networks today. How can convergence times for these protocols be improved? The first step in the process was to move from talking to talking faster.
如果我们想提高路由协议的收敛性,最容易从哪里开始?鉴于重点关注各个计时器之间的交互、路由信息的分发以及新转发表的计算,最明显的起点是协议用来确定何时通告新信息的计时器。为了更好地理解缩短这些计时器时所面临的问题,我们首先需要了解计时器为何存在。让我们考虑在图 5.8所示的网络上运行的 OSPF 。
If we want to improve the convergence of a routing protocol, where would be the easiest place to start? Given the focus on the interaction between the various timers, the distribution of routing information, and the calculation of new forwarding tables, the most obvious place to begin would be with the timers the protocols use to determine when to advertise new pieces of information. To better understand the problem, we face when shortening these timers, we need to understand why the timers are there in the first place. Let’s consider OSPF running on the network shown in Figure 5.8.
为了说明这一点,假设 2001:db8:0:1::/64 链路正在抖动,因此失败,然后每隔 200 或 300 毫秒重新连接一次。如果每次该链路状态发生变化时,Router A 都会生成一个新的 LSA,会发生什么情况?路由器 B 和 C 将被大量路由更新淹没,从而占用带宽和缓冲区空间。为了防止这种情况发生,OSPF 设计了路由器在通告链路状态更改之前等待的最短时间(LSA生成计时器)。如果 LSA 生成计时器设置得相当高,本地连接链路中的快速状态变化将被抑制——像这里描述的那样的持续抖动不会导致LSA 不断涌入网络。另一方面,将LSA生成定时器设置得太高会导致网络收敛速度变慢。我们如何解决这个问题呢?
Just to illustrate the point, assume the 2001:db8:0:1::/64 link is flapping so it fails, then reconnects, every 200 or 300 ms. What happens if Router A generates a new LSA each time this link changes state? Routers B and C would be flooded with routing updates, taking up bandwidth and buffer space. To prevent this from happening, OSPF is designed with a minimal amount of time a router will wait before advertising a link state change—the LSA generation timer. If the LSA generation timer is set fairly high, rapid status changes in locally connected links will be damped—constant flaps like the one described here will not cause a constant flood of LSAs flooded through the network. On the other hand, setting the LSA generation timer too high will cause the network to converge more slowly. How do we resolve this problem?
解决方案是使用可变计时器,允许路由器 A 在 2001:db8:0:1::/64 链路更改状态时非常快速(甚至立即)通告新的链路状态,但随后“退出”,因此路由器 A 必须等待一段较长的时间才能通告下一个链路状态更改。这将使网络能够针对单个更改快速收敛,但会减弱短时间内发生的大量更改的影响。
The solution is to use a variable timer—allow Router A to advertise the new link state very quickly (or even immediately) when the 2001:db8:0:1::/64 link changes state, but then “back off,” so that Router A must wait some longer period of time before advertising the next link state change. This will allow the network to converge quickly for single changes, but dampen the effect of a large number of changes happening over a short period of time.
这正是指数退避(目前大多数 OSPF 实现中包含的一项功能)所做的事情。计时器设置为一个非常低的数字,并且每次发生新事件时都会增加计时器的值,直到达到新广告之间的某个最大时间量。随着时间的推移,没有任何事件发生,计时器会减少,直到最终再次达到最小值。图 5.9说明了指数退避定时器的操作。
This is precisely what exponential backoff, a feature now included in most OSPF implementations, does. The timer is set to a very low number, and increased each time a new event occurs, until it reaches some maximum amount of time between new advertisements. As time passes with no events occurring, the timer is reduced until it eventually reaches the minimum again. Figure 5.9 illustrates the exponential backoff timer operation.
在图 5.9中,2001:db8:0:1::/64 的第一次故障导致路由器 A 立即生成 LSA(接近),以便尽快通知网络的其余部分。RouterA发送该LSA后,将自己的LSA生成定时器修改为定时器步长1。然后,路由器 A 立即开始“衰减”LSA 生成计时器,随着时间的推移慢慢减少它,而不会发生任何新的链路状态变化。然而,过了一会儿,第二次拓扑发生变化。路由器 A 现在等待,直到计时器到期,发送新的 LSA,并为计时器添加更多时间,将其增加到计时器步骤 2图中。同样,路由器 A 开始慢慢减少计时器,同时 2001:db8:0:1::/64 链路保持稳定。然而,第三次状态更改会导致路由器 A 等待,直到计时器到期,传输新的 LSA,并再次为计时器值添加更多时间。此时,已达到最大计时器设置- 无论未来发生多少次故障,计时器都不会设置得高于此值。
In Figure 5.9, the first failure of 2001:db8:0:1::/64 causes Router A to generate an LSA (close to) immediately, so the rest of the network is informed as quickly as possible. Router A, after sending this LSA, modifies its LSA generation timer to Timer Step 1. Router A then immediately begins “decaying” the LSA generation timer, reducing it slowly as time passes without any new link state changes. After a few moments, however, the second topology change occurs. Router A now waits until the timer expires, sends out a new LSA, and adds more time to the timer, increasing it to Timer Step 2 in the illustration. Again, Router A begins reducing the timer slowly while the 2001:db8:0:1::/64 link remains stable. A third state change, however, causes Router A to wait until the timer expires, transmit a new LSA, and, again, add more time to the timer value. At this point, the maximum timer setting has been reached—no matter how many future failures occur, the timer will never be set any higher than this value.
这种类型的指数退避也可以应用于 SPT 计算之间的间隔,以便为少量变化提供快速反应,而不会因更新的信息和处理要求而使网络过载。
This type of exponential backoff can be applied to the interval between SPT calculations, as well, to provide fast reaction for small numbers of changes without overloading the network with updated information and processing requirements.
笔记
Note
如果这看起来类似于 BGP 抑制图表,那是因为抑制和指数退避使用相同的原理和技术来提高网络稳定性,同时允许快速通知可达性或拓扑的微小变化。
If this looks similar to a chart for BGP dampening, that’s because it is—dampening and exponential backoff use the same principles and techniques to promote network stability while allowing for fast notification of small changes in reachability or topology.
在说话速度更快和复杂性之间的关系中,有几点值得考虑。
Several points are worth considering in the relationship between talking faster and complexity.
• 仅向控制平面添加最少量的状态 — 事实上,根本没有向路由器之间传送的控制平面状态添加任何新状态。必须将许多额外的计时器添加到控制平面协议实现中(当然,这可能非常复杂,具体取决于计时器必须运行的粒度 - 每个邻居/对等点、每个前缀等),但这对于“在线”协议本身来说都是不透明的。线路上可能有(也可能没有)更多流量来提供更快的收敛,具体取决于计时器的设置方式。在较旧的网络中,传输速度较低,这些额外的数据包可能会对网络本身的运行产生很大影响。在更现代、更高速的链路中,相对于收敛速度的增益,这种额外的流量是最小的。这些类型的计时器修改,控制平面中的操作速度控制平面在整个网络中传播状态。随着事件发生率的增加(指数退避)而增加广告事件的间隔,可以降低操作速度,以至于如果将两种技术结合起来,几乎不会增加任何额外的复杂性。
• Only a minimal amount of state is added to the control plane—in fact, no new state is added to the control plane state carried between routers at all. A number of additional timers must be added to the control plane protocol implementation (this can be quite complex, of course, depending on the granularity at which the timers must operate—per neighbor/peer, per prefix, etc.), but this is all opaque to the “on the wire” protocol itself. There may—or may not—be more traffic on the wire to provide for faster convergence, depending on how the timers are set. In older networks, with lower transmissions speeds, these additional packets could have a large impact on the operation of the network itself. In more modern, higher speed links, this additional traffic is minimal against the gains in convergence speed. These sorts of timer modifications, done correctly, might increase the speed of operations in the control plane, the speed at which the control plane is propagating state throughout the network. Increasing the spacing of advertising events as the rate of events increases (exponential backoff), mitigates the speed of operations to the point that little additional complexity is added if the two techniques are combined.
• 也许最复杂的是通过控制平面中交互表面的指数退避计时器来增加的。运营商现在拥有大量设备,每个设备都具有相当独立的计时器,并且可以以不同的速率运行,而不是拥有大量具有相当一致的计时器的设备。这使得在任何给定时刻准确预测网络状态变得更加困难,在故障排除或确定特定情况下控制平面行为时,会引入更多的“量子状态”性质。与单一故障情况相比,多个故障情况现在将产生不同的事件链,可能导致难以跟踪竞争条件和复杂表面相互作用的其他工件。
• Perhaps the most complexity is added through exponential backoff timers along the interaction surfaces in the control plane. Rather than having a lot of devices with fairly consistent timers, the operator now has a lot of devices, each with fairly independent timers that could be running at different rates. This makes it harder to predict, at any given moment, precisely what the state of the network is, introducing a more “quantum state” nature to the mix when troubleshooting or determining what control plane behavior will be in specific situations. Multiple failure situations will now produce a different chain of events than single failure situations, potentially causing difficult to trace race conditions and other artifacts of complex surface interactions.
笔记
Note
与所有控制平面修改一样,在将指数退避配置为提供更快收敛的机制时,需要考虑明确的权衡。例如,单个行为不当的设备(或攻击者)可以通过将拓扑的错误更改注入到控制平面中,强制指数退避计时器保持在高位(接近图 5.9 中的最大计时器设置)。如果这些计时器被迫保持高于应有的水平,网络对实际变化的反应就会太慢,可能会导致应用程序故障。这似乎有点牵强,但它也不是典型的网络管理员在解决应用程序性能问题时会寻找的东西。
As with all control plane modifications, there is a definite tradeoff to be considered when configuring exponential backoff as a mechanism to provide faster convergence. For instance, a single misbehaving device (or an attacker) can, by injecting false changes to the topology into the control plane, conceivably, force the exponential backoff timers to remain high (near the maximum timer setting in Figure 5.9). If these timers are forced to remain higher than they should be, the network will react too slowly to real changes, potentially causing application failures. This might seem to be far-fetched, but it’s also not something the typical network administrator is going to look for when troubleshooting application performance issues.
因此,修改路由协议中的收敛计时器通常会在收敛时间方面提供大量增益,而不会产生巨大的复杂性影响。
Modifying the convergence timers in routing protocols, then, typically provides a lot of gain in terms of convergence time without having a huge complexity impact.
说得越来越快之后,接下来该做什么?如果使定时器更加智能可以显着缩短收敛时间,那么简单地使定时器完全脱离收敛又如何呢?这就是预先计算的 LFA 路径的作用。图 5.10将用于说明其工作原理。
What’s next after talking and talking faster? If making the timers more intelligent can dramatically improve convergence time, then what about simply taking the timers out of convergence altogether? This is what precomputed LFA paths does. Figure 5.10 will be used to illustrate how this works.
在该网络中,Router A 有两条路径到达 Router C 后面的目的地:通过 Router B,总成本为 2;通过 Router D,总成本为 3。在这种情况下,Router A 显然会选择经过 Router C 的路径。 B 的总开销为2。但是为什么RouterA 不能将经过RouterD 的路由设置为备份路径呢?很明显,该路径是无环路的。
In this network, Router A has two paths to reach destinations behind Router C: via Router B with a total cost of 2, and via Router D with a total cost of 3. In this case, Router A will obviously choose the path through Router B with a total cost of 2. But why couldn’t Router A install the route through Router D as a backup path? It’s obvious the path is loop free.
事实上,这就是预计算的作用——发现备用的无环路路径并将它们安装为本地可用的备份路径。要记住的关键点是,路由器无法像网络工程师那样“看到”网络,通过网络端到端的实际图表来绘制到达单个目的地的备用路径(事实并非如此)无论如何,就像在地图上找到替代路径一样简单 - 路径的成本会产生很大的差异,正如我们将在下一节中看到的那样)。路由器所能做的就是检查两条路径的成本,尤其是任何备用路由的下一跳通告的成本,以确定该路径是否实际上应用作 LFA。
This is, in fact, what precomputing does—discover alternate loop-free paths and install them as a locally available backup path. The key point to remember is that routers can’t “see” the network like a network engineer, with an actual diagram of the network end-to-end on which to map out alternate paths to reach a single destination (and it’s not so simple as to find the alternate paths on a map, anyway—the cost of the path makes a big difference, as we’ll see in the next section). All the router can do is examine the cost of the two paths, especially the cost being advertised by the next hop of any alternate route, to determine if the path should actually be used as a LFA.
此时,Router A 可以查看Router D 的Cost,判断经过Router D 的路径不可能是环路,因为Router D 的路径Cost 小于Router A 的路径Cost。在 EIGRP 术语中,这称为可行性测试;如果邻居的报告距离(路由器 D 处的成本)小于本地可行距离(路由器 A 处的最佳路径),则该路径是无环路,可以用作备份。OSPF 和 IS-IS 计算 LFA 的方式大致相同,都是通过从邻居的角度计算到任何给定目的地的成本来确定路径是否无环路。
In this case, Router A can look at Router D’s cost and determine that the path through Router D cannot be a loop, because the cost of Router D’s path is less than the cost of Router A’s path. In EIGRP terms, this is called the feasibility test; if the neighbor’s Reported Distance (the cost at Router D) is less than the local Feasible Distance (the best path at Router A), then the path is loop free, and can be used as a backup. OSPF and IS–IS calculate LFAs in much the same way, by calculating the cost to any given destination from the neighbor’s point of view to decide if the path is loop free or not.
让我们从复杂性角度考虑 LFA 路径的预计算。
Let’s consider the precomputation of LFA paths in complexity terms.
• 与定时器修改一样,预计算LFA 路径不会向控制平面中通过网络承载的状态添加任何内容。计算这些备用路径所需的所有信息都可以在信息中找到已经由控制平面携带,包括 EIGRP 和链路状态协议,因此没有理由添加任何新信息。当然,预计算备用路径确实会增加任何特定实现的内部状态,从而引入需要测试的新代码路径。
• Like timer modification, precomputing LFA paths doesn’t add anything to the state carried through the network in the control plane. All the information required to compute these alternate paths is available in information already carried by the control plane, both with EIGRP and with link state protocols, so there is no reason to add any new information. Precomputing alternate paths does, of course, increase the internal state for any specific implementation, introducing new code paths that need to be tested.
• 预计算对控制平面的运行速度几乎没有影响——如果有的话,如果可以通过预计算保护通过网络的大多数路径,则上一节中讨论的定时器修改可能会变得不那么重要,从而使收敛速度变得更快“更懒”,控制平面中的通知速度更慢。
• Precomputation has little to no impact on the speed at which the control plane operates—if anything, if most paths through a network can be protected through precomputation, the timer modifications discussed in the previous section may become less important, allowing convergence speeds to become “lazier,” and the pace of notifications in the control plane to be slower.
• 也许最复杂的部分是通过沿着交互表面的预计算增加的——就像指数退避计时器一样在控制平面中。这在两个维度上都是如此——控制平面操作和操作状态。单个或多个故障可能导致重叠的预计算路径发挥作用,从而导致网络出现意外状态,从而使控制平面操作变得更加复杂。种族和其他条件是必须考虑(并可能进行测试)的可能副作用。从运行状态的角度来看,预先计算的路径为每个转发设备操作人员添加了多一位内部状态——什么是备用路径,网络是否处于切换到备用路径和计算新的最佳路径之间的状态,例如。如果两个故障同时或接近同时发生,操作员很难预测结果,
• Perhaps the most complexity is added through precomputation—as was true with exponential backoff timers—along the interaction surfaces in the control plane. This is true along two dimensions—control plane operation and operational state. Control plane operation is made more complex by the possibility that a single or multiple failures may cause overlapping precomputed paths to come into play, causing unanticipated states in the network. Race and other conditions are a possible side effect that must be considered (and possibly tested for). From an operational state perspective, precomputed paths add one more bit of internal state to each forwarding device operators must pay attention to—what is the alternate path, is the network in a state between switching to the alternate path and calculating a new best path, for instance. In the case of two failures occurring at or near the same time, it’s difficult for an operator to predict the outcome, and hence to know how the network will actually converge (will it converge within the required bounds).
与修改计时器以改进协议收敛的情况一样,预计算路径在明显复杂性的微小增加的情况下提供了大量收益。事实上,EIGRP 可行后继(一种预先计算路径的形式)已在许多大型网络中部署多年,复杂性几乎没有明显增加。
As was the case in modifying timers to improve protocol convergence, precomputing paths offers a lot of gains against very small increases in apparent complexity. In fact, EIGRP Feasible Successors, a form of precomputed paths, have been deployed in many large-scale networks for many years with little noticeable increases in complexity.
如果图 5.10中的指标稍微改变,添加一个额外的路由器,并产生如图 5.11所示的网络,会发生什么?
What happens if the metrics in Figure 5.10 are changed slightly, throwing in an extra router, and resulting in the network illustrated in Figure 5.11?
Router A 仍然有两条路径到达 Router C 之外的每个目的地,例如 2001:db8:0:1::/64,但经过 Router D 的路径不能用作 LFA。为什么?如果经过RouterB到达2001:db8:0:1::/64的路径出现故障,RouterA切换到通过路由器 D 的备用路径,路由器 D 会将该流量转发到哪里?RouterD有两条路径到达2001:db8:0:1::/64;通过路由器 E,成本为 4,通过路由器 A,成本为 3。然后,路由器 D 会将路由器 A 发送的任何流量(目的地为 2001:db8:0:1::/64)转发回路由器A。
Router A still has two paths to every destination beyond Router C, such as 2001:db8:0:1::/64, but the path via Router D cannot be used as a LFA. Why? If the path to 2001:db8:0:1::/64 through Router B fails, and Router A switches to the alternate path through Router D, where will Router D forward this traffic? Router D has two paths to 2001:db8:0:1::/64; through Router E with a cost of 4, and through Router A with a cost of 3. Router D will, then, forward any traffic Router A sends it with a destination of 2001:db8:0:1::/64 back to Router A.
在这种情况下,路由器 A 使用路由器 D 作为 LFA,实际上会在经过路由器 B 的路径发生故障和路由器 D 重新计算其最佳路径并开始使用经过路由器 E 的路径之间的时间内导致路由环路达到2001:db8:0:1::/64。这称为微循环,因为循环是由控制平面引起的,并且持续时间往往很短(它不是永久循环)。
Router A using Router D as a LFA, in this case, actually results in a routing loop during the time between the failure of the path through Router B, and the time when Router D recomputes its best path and starts using the path through Router E to reach2001:db8:0:1::/64. This is called a microloop, because the loop is caused by the control plane, and it tends to be for a very short duration (it isn’t a permanent loop).
我们该如何解决这个问题?最明显的答案是不使用通过路由器 D 的路径作为备份路径,但我们试图在网络拓扑发生变化时提供快速的流量重新路由,而不是丢弃数据包。快速重路由解决方案可以通过隧道经过路由器 D 到达网络中的某个点来解决此问题,在该点,流量不会转发(循环)回路由器 A。有多种方法可以实现此目的:
How can we resolve this? The most obvious answer is to simply not use the path through Router D as a backup path—but we’re trying to provide for fast rerouting of traffic in the case of a change in the network’s topology, not dropped packets. Fast reroute solutions can solve this problem by tunneling past Router D, to some point in the network where the traffic won’t be forwarded (looped) back toward Router A. There are several ways to accomplish this:
• 从更广泛的邻居环的角度计算SPT,直到您找到几跳之外的一个点,您可以通过隧道将流量传送到该点,而不会使其循环回到您自己。大多数执行此操作的机制都会停止在两跳(您的邻居的邻居)处进行搜索,因为这会以这种方式找到可用的所有备用路径中的绝大多数。
• Compute the SPT from the perspective of a wider ring of neighbors until you find a point several hops away where you can tunnel traffic to without it looping back to yourself. Most mechanisms that do this stop searching at two hops (your neighbor’s neighbor), because this finds the vast majority of all alternate paths available in this way.
• 通过路由器 D 和 E 将 2001:db8:0:1::/64 的可达性通告给路由器 A。1
• Advertise reachability to 2001:db8:0:1::/64 through Routers D and E to Router A.1
1 . 例如,请参阅 S. Bryant、S. Previdi 和 M. Shand,“使用非通过地址的 IP 和 MPLS 快速重新路由框架”(IETF,2013 年 8 月),2015 年 9 月 15 日访问,https: // www.rfc-editor.org/rfc/rfc6981.txt。
1. See, as an example S. Bryant, S. Previdi, and M. Shand, “A Framework for IP and MPLS Fast Reroute Using Not-Via Addresses” (IETF, August 2013), accessed September 15, 2015, https://www.rfc-editor.org/rfc/rfc6981.txt.
• 使用诸如随机游走的深度优先搜索等算法来计算替代拓扑。2
• Compute an alternate topology using an algorithm such as depth first searching with a random walk, or others.2
2 . 例如,参见 A. Atlas 等人,“An Architecture for IP/LDP Fast-Reroute using Maximally Redundant Trees”(IETF,2015 年 7 月),2015 年 9 月 15 日访问,https: //www.ietf.org/ id/draft-ietf-rtgwg-mrt-frr-architecture-06.txt。
2. See, as an example A. Atlas et al., “An Architecture for IP/LDP Fast-Reroute Using Maximally Redundant Trees” (IETF, July 2015), accessed September 15, 2015, https://www.ietf.org/id/draft-ietf-rtgwg-mrt-frr-architecture-06.txt.
• 计算从目的地到本地路由器的反向树;在此反向树上找到不在从本地路由器到目的地的当前最佳路径上的点。3
• Compute a reverse tree from the destination toward the local router; find a point on this reverse tree that is not on the current best path from the local router toward the destination.3
3 . 有关更多信息,请参阅《网络架构的艺术:业务驱动设计》第一版中的第 8 章“风雨如磐” 。(印第安纳州印第安纳波利斯:思科出版社,2014 年)。
3. For further information, see Chapter 8, “Weathering Storms,” in The Art of Network Architecture: Business-Driven Design, 1st edition. (Indianapolis, Indiana: Cisco Press, 2014).
一旦找到可以安全传输流量的远程点,就必须自动创建某种形式的隧道以用作备份路径。该隧道必须以非常高的度量插入到本地转发表中,或者以防止在从表中删除主(非隧道)路径集之前使用它的方式插入。隧道的端点还必须有一个隧道尾端,通过它来处理沿着该路径转发的数据包,最好是自动配置的东西,以减少网络管理和配置开销。
Once the remote point where traffic can be safely tunneled is found, some form of tunnel must be automatically created to use as a backup path. This tunnel must be inserted in the local forwarding table with a very high metric, or inserted in such a way as to prevent it from being used until the set of primary (nontunneled) paths have been removed from the table. The end point of the tunnel must also have a tunnel tail end though which to handle packets being forwarded down this path, preferably something that is autoconfigured to reduce network management and configuration overhead.
回到到目前为止开发的状态、速度、表面、优化模型将有助于揭示这些类型的隧道快速重路由给网络带来的新复杂性。
Returning to the state, speed, surface, optimization model developed up to this point will help expose what new complexities these types of tunneled fast reroute add to the network.
• 隧道解决方案以多种方式添加状态。首先,存在额外的内部和实现状态,这两者都增加了复杂性。此状态包括当网络拓扑发生变化时重新计算任何隧道备份路径所需的额外处理。其次,一些机制需要显式地通告额外的控制平面状态(例如NotVia),这增加了控制平面协议的复杂性。第三,如果用于构建快速重路由备份路径的隧道协议当前未部署在网络中,则必须部署它们 - 向堆栈添加其他协议、配置等。最后,整个网络中的配置(或自动配置)隧道端点通过增加配置而增加了复杂性。
• Tunneled solutions add state in a number of ways. First, there is the additional internal and implementation state, both of which add complexity. This state includes additional processing required to recalculate any tunneled backup paths when the network topology changes. Second, some mechanisms require the advertisement of additional control plane state explicitly (such as NotVia), which adds control plane protocol complexity. Third, if the tunneling protocol used to build the fast reroute backup paths are not currently deployed in the network, they must be deployed—adding additional protocols to the stack, configurations, etc. Finally, the provisioning (or autoprovisioning) tunnel endpoints throughout the network increases complexity by increasing configuration.
• 隧道快速重路由机制不会提高控制平面的运行速度,也不会提高网络拓扑发生变化的速度。事实上,快速重路由实际上可以降低速度通过减少对上述花哨和调整的计时器的需求来进行更新。在这种情况下,隧道快速重路由机制可以是降低复杂性的一种方法。
• Tunneled fast reroute mechanisms don’t increase the speed at which the control plane operates, nor the speed at which changes occur in the network topology. In fact, fast reroute can actually reduce the speed of updates by reducing the need for the fancy and tuned timers described above. In this case, tunneled fast reroute mechanisms can be a way to reduce complexity.
• 隧道解决方案以多种方式增加交互表面的尺寸和深度。事实上,这可能是这些解决方案增加网络复杂性的主要方式。几个例子就足够了——首先,确定网络中的流量模式要复杂得多,特别是在网络中断期间。对于重视高水平利用率(而不是过度建设和忽视服务质量)的网络,这增加了在规划、配置和管理服务质量问题时需要考虑的许多因素。第二,因为流量可以在瞬态期间通过隧道传输到网络中多跳之外的点,因此跟踪特定流或流的流动可能会变得更加困难。在拓扑结构不断变化的网络中,这可能成为一项巨大的挑战。第三,在确定流量流向何处及其原因时,操作员必须记住查看多个位置 - 控制平面的内容不仅仅是当前转发表中的内容。最后,隧道机制要求大量设备能够动态终止隧道以重新路由流量。这些动态开放的隧道尾端中的每一个都是潜在的安全威胁和额外的管理复杂性。
• Tunneled solutions increase the size and depth of interaction surfaces in a number of ways. In fact, this is probably the primary way in which these solutions increase network complexity. A few examples will suffice—First, determining the traffic patterns in a network is much more complex, particularly during network outages. For networks that value a high level of utilization (rather than overbuilding and ignoring quality of service), this adds a number of factors to consider when planning, configuring, and managing quality of service issues. Second, because traffic can travel over tunnels during transient states to a point multiple hops away in the network, following the flow of a particular stream or flow can become much more difficult. In networks with constant topology change, this can become an overwhelming challenge. Third, operators must remember to look in more than one place when determining where traffic is flowing and why—there is more to the control plane than just what is in the current forwarding table. Finally, tunneled mechanisms require that a large number of devices be capable of dynamically terminating tunnels to reroute traffic. Each of these dynamically open tunnel tail ends are potential security threats and additional management complexity.
现在应该很明显,从默认计时器到预先计算的 LFA,再到隧道 LFA,涉及到复杂性上的两个稍微小的步骤,以及复杂性稍大的最后一步。图 5.12说明了这个概念。
It should be obvious by now that moving from default timers, to precalculated LFAs, to tunneled LFAs, involves two somewhat small steps up in complexity, and one final step that’s a bit larger in complexity. Figure 5.12 illustrates this concept.
笔记
Note
隧道 LFA 和预计算 LFA 在每个解决方案提供的收敛速度上几乎相同,因此它们在图 5.12中紧密地显示在一起。这两者之间的主要权衡是控制平面的复杂性与各自将覆盖的拓扑类型。隧道式 LFA 可以覆盖几乎所有拓扑(有一些例外,非平面拓扑并不总是被所有隧道式 LFA 技术覆盖),但代价是控制平面更加复杂。
Tunneled and precomputed LFAs are close to identical in the speed of convergence offered by each solution, so they are shown close together in Figure 5.12. The primary tradeoff between these two is the amount of control plane complexity versus the types of topologies each will cover. Tunneled LFAs can cover just about every topology (with some exceptions—nonplanar topologies aren’t always covered by every tunneled LFA technology), at the cost of more complexity in the control plane.
您可以将每个特定的解决方案放置在该曲线上的不同位置 - 这完全取决于您正在使用的网络的要求,以及您如何看待向更快的融合所增加的复杂性。无论哪种方式,沿着通话连续体、更快通话、预计算和隧道的移动都会将另一层互连实体作为系统、另一组交互表面和另一组不确定点添加到网络中。虽然每一个表面上看起来都相当简单,但当试图快速收敛时,复杂性会迅速增加。这里的权衡是真实的、持久的和紧迫的。
You might place each specific solution in a different place on this curve—it all depends on the requirements of the network you’re working on, and how you perceive the added complexity of moving to ever faster convergence. Either way, movement along the continuum of talk, talk faster, precompute, and tunnel adds another layer of interconnected entities into the network as a system, another set of interaction surfaces, and another group of uncertainty points. While each one appears to be fairly simple on the surface, complexity can add up very quickly when trying to converge quickly. The tradeoffs here are real, persistent, and urgent.
快速融合的游戏会有终结吗?让我们反过来问这个问题。网络本质上是安抚人类的急躁情绪。网络工程师面临的问题是,人类的不耐烦没有明显的终点——无论事情有多快,它总是可以更快。正如本章开头所指出的,电子邮件和文件传输可以在控制平面收敛速度较快的情况下正常工作。然而,语音和视频对网络提出了更高的期望。股市的高速交易和实时医疗应用对网络融合提出了更高的要求。随着机器对机器和其他应用程序跨网络部署,它们可能会推动收敛速度不断提高,直到工程师们只是达到了使网络控制平面更快收敛的极限。简而言之,让网络更快收敛的努力可能没有尽头。
Will there ever be any end to the fast convergence game? Let’s put the question the other way around. Networks are, in essence, a sop to human impatience. The problem network engineers face is that there is no apparent end point in human impatience—no matter how fast something is, it can always be faster. As noted in the beginning of this chapter, email and file transfer work fine with moderately fast control plane convergence. Voice and video, however, put higher expectations on the network. High-speed trading in the stock market and real-time medical uses place even harder requirements on network convergence. As machine to machine and other applications are deployed across networks, they are likely to push ever increasing convergence speeds, until engineers have simply reached the limit of what can be done to make a network control plane converge faster. In short, there is probably no end in sight for trying to make the network converge more quickly.
虚拟化是网络世界中“永远都是新技术”的技术之一。当帧中继首次提供虚拟电路时,它是新的;当异步传输模式(ATM)首次提出时,它又是新的;当以太网上的VLAN被提出时,它也是新的;当MPLS(最初被称为标签交换)被提出时,它也是新的。 ,并且在将来的某个实例中它将再次成为新的。为什么虚拟化是网络工程中如此流行且经久不衰的话题?因为它提供了一种隐藏信息的方法(更多关于信息隐藏的信息可以在第 6 章“管理设计复杂性”中找到)”),它允许网络运营商以看似简单的方式解决复杂的问题。为了理解虚拟化背后的原因,我们来看一个使用图 5.13所示网络的简单用例。
Virtualization is one of those “always new” technologies in the world of networking. It was new when Frame Relay first provisioned virtual circuits, it was new again when Asynchronous Transfer Mode (ATM) was first proposed, it was new when VLANs on Ethernet were proposed, and it was new when MPLS originally known as tag switching) was proposed, and it will be new again in some future instance. Why is virtualization such a popular and perennial topic in network engineering? Because it provides a way to hide information (more on information hiding can be found in Chapter 6, “Managing Design Complexity”), which allows network operators to resolve complex problems in apparently simple ways. To understand the reasoning behind virtualization, let’s look at a simple use case using the network illustrated in Figure 5.13.
假设我们想要确定主机 A 和 F 可以相互通信,主机 B 和 E 可以相互通信,但主机 A 和 B 不能,A 和 E 也不能,B 和 F 也不能,E 和 F 也不能。当然,我们可以在网络的所有接口上设置过滤器,以确保流量只能在我们决定应该通信的主机之间传递,而流量不能在我们不想通信的主机之间传递。然而,这将需要大量的管理和配置工作,创建两个虚拟拓扑(只有每个虚拟拓扑中的设备可以在其中进行通信)并将正确的主机集连接到每个虚拟拓扑会更简单。
Assume we want to be certain that Hosts A and F can talk to one another, and Hosts B and E can talk to one another, but Hosts A and B cannot, nor A and E, nor B and F, nor E and F. We could, of course, set up filters at all the interfaces through the network to ensure traffic can only pass between the hosts we’ve decided should communicate, and traffic cannot pass between hosts we don’t want to communicate. This would take a lot of management and configuration effort, however—it’s simpler to create two virtual topologies on which only devices within each virtual topology can communicate, and attach the right set of hosts to each virtual topology.
在这种情况下,我们可以在一个拓扑中包括 E(它是上游路由器)、D([D,E] 链路)、B 的上游路由器、C([B,C] 链路)和 B。在另一种拓扑上我们可以包括 A、A 的上游路由器、C、[A,C] 链路、D、F 的上游路由器、[D,F] 链路和 F。只要我们有某种方法沿着 [ C,D]链接,因此每个拓扑似乎都是它自己的“网络”,我们可以构建两个明显独立的拓扑来承载流量。
In this case, we could include E, it’s upstream router, D, the [D,E] link, B’s upstream router, C, the [B,C] link, and B, on one topology. On the other topology we could include A, A’s upstream router, C, the [A,C] link, D, F’s upstream router, [D,F] link, and F. So long as we have some way to separate the two topologies along the [C,D] link, so each topology appears to be its own “network,” we can build two apparently independent topologies through which to carry traffic.
虽然这有点过于简单化,但这就是大多数虚拟化部署的本质——创建虚拟拓扑的能力使我们能够将配置大量过滤器的复杂性转移到一组额外的控制平面和数据平面原语的复杂性。
While this is a bit oversimplified, this is the essence of most virtualization deployments—the ability to create virtual topologies allows us to transfer the complexity of configuring a lot of filters into the complexity of an additional set of control plane and data plane primitives.
虽然进行虚拟化的原因有很多,但它们都归结为某种形式的多路复用,无论是在应用程序、用户组、用户还是其他一些分界线之间。虚拟化也有多种形式,但它们都归结为同一组概念——放置一个输出包装器、隧道标头、或数据包标头中的其他一些标记,以将一种拓扑与另一种拓扑分离。让我们从这个相当简单的角度来考虑虚拟化的一些权衡。
While there are many reasons to virtualize, they all come down to some form of multiplexing or another, whether between applications, groups of users, users, or some other dividing line. There are also many forms of virtualization, but they all come down to the same set of concepts—putting an out wrapper, tunnel header, or some other maker in the packet header to segregate one topology from another. Let’s consider some of the tradeoffs of virtualization from this rather simplistic viewpoint.
虚拟化允许网络运营商将单个问题分解为多个部分,然后分别解决每个部分。一个简单的例子 - 不是管理来自数十个不同应用程序的数百个不同流的服务质量和交互,而是可以围绕沿着一个虚拟拓扑运行的一小组应用程序进行配置,然后单独管理虚拟拓扑间的配置。这种功能分层本质上是我们在构建协议、分层网络设计,甚至基于单元而不是整体的软件构建和测试中看到的。功能分离是一个主要的简化工具;虚拟化是一种经过时间考验的提供功能分离的良好方法。
Virtualization allows the network operator to break a single problem into multiple parts, and then solve each one separately. An easy example—rather than managing the quality of service and interactions of hundreds of different flows from dozens of different applications, configurations can be molded around a small set of applications traveling along one virtual topology, and then manage inter-virtual topology configuration separately. This layering of functions is essentially what we see in building protocols, hierarchical network designs, and even the building and testing of software based on units, rather than as monoliths. Functional separation is a major simplification tool; virtualization is a good, time tested way to provide functional separation.
然而,将层和问题分成多个部分可能会增加必须相互作用的表面的数量,从而产生更多的复杂性。创建的每个组件还会在新模块和其他现有模块之间创建一组交互点。回到服务质量的例子,虽然需要管理的流量和应用程序较少(通过虚拟化将它们分组),但仍需要回答许多新问题,例如:
Separating layers and problems into multiple parts, however, can generate more complexity by increasing the number of surfaces which must interact. Each component created also creates a set of points of interaction between the new module and other existing modules. To return to the quality of service example, while there are fewer traffic flows and applications to manage—they’re grouped off by virtualizing them—a number of new questions need to be answered, such as:
• 每个虚拟拓扑是否应被视为单个“服务类别”,还是每个虚拟拓扑内应有多个“桶”或服务类别?
• Should each virtual topology be considered a single “class of service,” or should there be multiple “buckets” or classes of service within each virtual topology?
• 如果每个虚拟拓扑将具有多个服务类别,那么应如何指示这些服务类别?将服务质量信息从“内部”标头复制到“外部”标头是否足够?这两条服务质量信息之间是否需要某种形式的映射?
• If there each virtual topology is going to have multiple classes of service, how should these classes of service be indicated? Is copying the quality of service information from the “inner” header to the “outer” header enough? Does there need to be some form of mapping between these two pieces of quality of service information?
• 不同类别的服务应如何在虚拟拓扑之间交互?是否应该将所有“黄金”服务集中到底层传输网络上的一个类别中,还是应该一种拓扑优先于其他拓扑(以及为什么)?当每个拓扑代表不同的客户时,这尤其成问题。如果两个客户都购买了“黄金”服务级别,当链路或路径严重拥塞以致无法维持此服务级别时,网络应如何反应对于两个客户?一个客户是否应该“输给”另一个客户?两者都应该接受降级服务吗?根据客户的规模或总价值,“黄金”类别中是否应该有不同的级别?这些都是很难回答的问题。
• How should the different classes of service interact between the virtual topologies? Should all the “gold” service be lumped into a single class on the underlying transport network, or should one topology have priority over the others (and why)? This is specifically problematic when each topology represents a different customer. If two customers have both purchased a “gold” level of service, how should the network react when a link or path is congested enough that this level of service cannot be maintained for both customers? Should one customer “lose” to the other? Should both receive degraded service? Should there be different levels within the “gold” class based on the size or total worth of the customer? These are difficult questions to answer.
转发平面更简单,因为网络中的每个数据包转发设备只需检查一小组标头位、标签或标记,以确定数据包可以在哪些链路上转发或不可以在哪些链路上转发。当然,另一种选择是拥有一组相当明确的过滤器,基于每个主机构建,转发设备必须检查这些过滤器才能做出相同的决定。考虑到维护和搜索此类列表的复杂性,报头位或外部报头是流量分离问题的更简单的解决方案。
The forwarding plane is simpler because each packet forwarding device in the network must only examine a small set of header bits, a label, or a tag, to determine which links a packet can be forwarded on or not. The alternative, of course, is to have a rather explicit set of filters, built on a per host basis, which forwarding devices must examine to make the same determination. Given the complexities of maintaining and search such a list, header bits or an outer header is a much simpler solution to the traffic separation problem.
另一方面,转发平面在某些方面更为复杂;流量在被网络接受时,必须对每个数据包施加一些位、标签或外部标头。当数据包传出网络时,必须剥离这些相同的信息位。
On the other hand, the forwarding plane is more complex in some ways; traffic, on being accepted into the network, must have some set of bits, a label, or an outer header imposed on each packet. These same bits of information must be stripped as the packet passes out of the network.
虚拟化通过将通常在多个不同控制平面之间的单个控制平面中承载的可达性、拓扑和策略信息进行分段来简化控制平面。因此,虚拟化堆栈中的每个单独的控制平面都更加简单,但是整体网络的复杂性又如何呢?添加更多相互堆叠的控制平面显然会增加复杂性,具体如下:
Virtualizing simplifies the control plane by segmenting off reachability, topology, and policy information normally carried in a single control plane among several different control planes. Each individual control plane in a virtualized stack is therefore simpler—but what of the overall network complexity? Adding more control planes stacked on top of each other clearly increases complexity by the following:
• 增加状态数量。影响多个拓扑的链路和节点状态必须在多个控制平面中承载。
• Increasing the amount of state. Link and node status impacting multiple topologies must be carried in multiple control planes.
• 加快变革速度。虽然发生变化的实际速率可能保持不变,但报告这些变化的速度(控制平面集中的信息流速度)必须增加。
• Increasing the speed of change. While the actual rate at which changes take place might remain the same, the speed at which these changes are reported (the velocity of information flow in the set of control planes) must increase.
• 增加整个网络的交互面。多个控制平面,每个控制平面都有自己的定时和收敛特性,现在必须在同一组拓扑信息上运行。wait for bgp等解决方案旨在管理这些互控平面交互,但这些解决方案也不可避免地增加了状态复杂性。
• Increasing the interaction surfaces throughout the network. Multiple control planes, each operating with their own timing and convergence characteristics, must now operate over the same set of topology information. Solutions such as wait for bgp are designed to manage these intercontrol plane interactions—but these solutions inevitably add state complexity, as well.
虽然共享风险链路组 (SRLG) 可能是虚拟化最容易理解的结果,但它们在网络设计和实施中通常是最难记住和解释的。事实上,有时,除非共同风险因重大停电而暴露出来,否则它们是不可能被发现的。本质上,每当两个虚拟链路共享相同的物理基础设施时,就会形成 SRLG。例如:
While Shared Risk Link Groups (SRLGs) are probably the most easily understood result of virtualization, they are often the most difficult to remember and account for in network design and implementation. Sometimes, in fact, they are impossible to discover until the common risk has reared its ugly head through a major outage. Essentially, an SRLG is formed whenever two virtual links share the same physical infrastructure. For instance:
• 在同一帧中继链路/节点上运行的两个不同电路。
• Two different circuits running over the same Frame Relay link/node.
• 两个不同的VLAN 在同一物理以太网链路上或同一交换机上运行。
• Two different VLANs running over the same physical Ethernet link, or across the same switch.
• 两个不同的虚拟进程在同一物理计算和存储集上运行。
• Two different virtual processes running on the same physical compute and storage set.
通常,SFRG 被隐藏在虚拟化层之下,以至于很少有设计人员考虑它们的存在或影响。在某些情况下,SRLG 可能会隐藏在合同和分包合同中——两个主要提供商通过公共光缆将冗余电路出租给单个设施,或者两个托管服务从单个数据中心提供冗余服务。网络设计者需要找到这些类型的重叠服务产品——例如,很少有提供商能够帮助确定哪些设施是共享的,哪些不是。
Quite often, SFRGs are buried under layers of virtualization to the point that very few designers think about their existence or impact. In some cases, SRLGs can be buried under contracts and subcontracts—two major providers leasing redundant circuits to a single facility across a common fiber cable, or two hosting services providing redundant services out of a single data center. It’s up to the network designer to find these sorts of overlapping service offerings—very few providers are going to be helpful in determining which facilities are shared, and which are not, for instance.
设计复杂性并不总是显而易见的,因此并不总是被完全考虑或处理。事实上,意外后果法则,即面对困难问题时无法解决或消除复杂性的副作用,使得许多设计问题变得最令人难以忘怀且难以解决。在下一章中,我们将介绍网络工程师拥有的工具,这些工具可以帮助平衡复杂性与其他因素。
Design complexity is not always obvious, and therefore not always completely considered or dealt with. In fact, the law of unintended consequences, a side effect of the impossibility of solving or removing complexity in the face of hard problems, makes many of these design problems the most haunting, and difficult to solve. In the next chapter, we’re going to look at the tools network engineers have that can help balance complexity against the other factors.
上一章探讨了工程师和架构师在设计网络拓扑和控制平面时需要考虑的权衡。网络工程师如何管理这些权衡——状态与延伸、拓扑与收敛速度、虚拟化与设计复杂性?什么时候应该选择一种(例如虚拟化)而不是另一种(例如命运共享)?事实上,这些是网络工程领域中最困难的决策之一,甚至可能更困难,因为它们很少作为有意的决策浮出水面。
The last chapter examined the tradeoffs engineers and architects need to consider when designing the topology and control planes of a network. How can the network engineer manage these tradeoffs—state versus stretch, topology versus the speed of convergence, and virtualization versus design complexity? When should one (for instance, virtualization) be chosen over the other (for instance, fate sharing)? These are, in fact, some of the most difficult decisions to make in the field of network engineering, perhaps even more so because they rarely come to the surface as intentional decisions at all.
本章将考虑工程师管理此类决策的几种方法,从传统到非传统。考虑的三个具体领域如下:
This chapter will consider a few ways engineers can manage these types of decisions, from the traditional to the not-so-traditional. Three specific areas considered are as follows:
• 模块化,可实现信息隐藏。
• Modularity, which enables information hiding.
• 信息隐藏作为管理控制平面状态的潜在解决方案。
• Information hiding as a potential solution to managing control plane state.
• 模型和可视化工具作为解决虚拟化和命运共享问题的方法。
• Models and visualization tools as a method to work with virtualization and fate sharing issues.
模块化是网络设计和工程中长期使用的、经过验证的、真实的方法。让我们考虑一些模块化设计的例子,然后考虑模块化设计如何解决复杂性问题。
Modularity is a longstanding, tried, and true method used in network design and engineering. Let’s consider some examples of modular design, and then consider how modular design attacks the complexity problem.
一致性是网络设计人员和工程师用来降低网络复杂性的常用机制。以下各节讨论其中几种技术及其权衡。
Uniformity is a common mechanism used by network designers and engineers to reduce complexity within a network. The following sections discuss several of these techniques, and their tradeoffs.
公司或网络运营团队通常选择同一供应商的设备在整个网络中使用,以减少接口和实施的数量,从而减少配置新节点和故障排除所需的培训量和各种技能集出现问题时。
It’s common for a company or network operations team to choose devices from the same vendor for use throughout their network to reduce the number of interfaces and implementations, thus reducing the amount of training and variety of skill sets required to configure new nodes as well as troubleshoot problems as they arise.
选择此方法来控制复杂性时需要进行三个权衡:
There are three tradeoffs when choosing this route to controlling complexity:
•供应商锁定:如果整个网络选择单一供应商的设备,那么该供应商最终将驱动硬件和软件的生命周期。如果没有大量的自我控制,供应商最终还将控制网络架构,无论这是否对公司来说是最好的。
• Vendor Lock-in: If a single vendor’s equipment is chosen throughout the network, then the vendor ends up driving the hardware and software life cycles. Barring a great deal of self-control, the vendor will also end up controlling the architecture of the network, whether or not that’s best for the company.
•成本:使用单一供应商几乎总是会增加成本,因为该供应商不面临竞争。
• Cost: Using a single vendor will almost always drive up cost, as the vendor faces no competition.
•单一文化失败:单一供应商的设备运行单一操作系统和单一应用程序集,属于单一文化。单一文化会遭受所有设备共享的一组故障模式的困扰;如果网络中的一个设备对一组特定情况反应不佳,则网络中的所有设备都会反应不佳。在单一文化中,看似小问题可能会演变成重大失败。
• Monoculture Failures: A single vendor’s equipment, running a single operating system, and a single set of applications, is a monoculture. A monoculture suffers from a shared set of failure modes across all the devices; if one device in the network reacts poorly to a specific set of circumstances, all the devices in the network will. Seemingly small problems can turn into major failures in a monoculture.
在单个数据中心结构中使用相同的物理交换机是实现一致性的一种方法。基于 Charles Clos 在 1952 年提出的原则,脊叶设计最初是为了构建基于最小同等尺寸交换机的大型结构,如图 6.1所示。
Using the same physical switches throughout a single data center fabric is one approach to uniformity. Based on the principles laid out by Charles Clos in 1952, spine and leaf designs were originally created to build a large fabric based on minimal equally sized switches, as shown in Figure 6.1.
假设图 6.1中所示的每个叶节点到某种负载的连接只有 4 个,则所示 Clos 结构中的每个交换机都具有相同数量的接口 — 8 个。使用单一设备类型(每个设备有 8 个接口),总共可以互连 32 个设备(内部或不连接到外部网络)。事实上,最初设计的全部目的是让通过结合小型、同等尺寸交换机的交换功能来互连大量设备。以这种方式在整个数据中心使用单一设备类型允许网络运营商最大限度地减少现有备件的数量,并最大限度地减少更换故障设备或链路硬件时涉及的配置和管理。
Assuming that there are only four connections from each leaf node shown in Figure 6.1 to some sort of load, each switch in the Clos fabric illustrated has the same number of interfaces—8. Using a single device type, each with 8 interfaces, a total of 32 devices can be interconnected (internally, or not connected to an outside network). In fact, the entire point of the original design was to allow the interconnection of a large number of devices by combining the switching capabilities of small, equally sized switches. Using a single device type throughout an entire data center in this way allows the network operator to minimize the number of spares on hand, as well as minimize the configuration and management involved when swapping out a failed device or link hardware.
笔记
Note
Clos 结构可能看起来类似于传统的分层以太网或 IP 网络设计,特别是当它的所有叶节点都位于一侧时(在“折叠”配置中)。然而,有几个关键的区别。例如,在主干和叶子拓扑(或结构)中,任何两个叶子节点或任何两个主干节点之间不存在交叉连接。此属性意味着除了通过叶节点的拓扑环路之外,结构中不存在拓扑环路。如果叶节点配置为无法将连接到主干节点的接口上收到的流量转发回另一个主干节点,则主干和叶节点设计没有拓扑环路。此属性简化了网络的设计,允许其通过添加更多主干和叶节点进行扩展,并减少控制平面上的工作。
The Clos fabric may appear similar to a traditional hierarchical Ethernet or IP network design, particularly when it is drawn with all the leaf nodes on one side (in the “folded” configuration). However, there are several crucial differences. For instance, in the spine and leaf topology (or fabric), there are no cross connections between any two leaf nodes, or any two spine nodes. This property means there are no topological loops in the fabric other than those passing through a leaf node. If the leaf nodes are configured so they cannot forward traffic received on an interface connected to a spine node back to another spine node, the spine and leaf design has no topological loops. This property simplifies the design of the network, allowing it to scale through the addition of more spine and leaf nodes, and reducing the work on the control plane.
一些现实使得降低复杂性的尝试比表面上看起来更加困难。例如,设备模型随着时间的推移而变化,这意味着需要认真考虑控制网络中安装的设备的生命周期。抛开理论不谈,在每个位置使用单个设备是非常困难的,即使是在数据中心结构中也是如此。主干交换机通常需要比架顶设备更多的端口数,胖树设计迫使设计人员在多个较便宜的固定配置设备之间进行选择,或者更一致地使用更昂贵的刀片或基于插槽的设备等。
Several realities make this attempt at reducing complexity more difficult than it might appear on the surface. For instance, device models change over time, which means serious thought needs to be put into controlling the life cycle of devices installed in the network. Theory aside, it’s very difficult to use a single device in every location, even in a data center fabric; spine switches often need higher port count than top of rack devices, fat tree designs force the designer into choosing between multiple, less expensive, fixed configuration devices or more consistent usage of more expensive, blade- or slot-based devices, etc.
可以从多个供应商购买设备,然后依赖控制平面和管理接口的开放标准实现。例如,网络运营商可能决定标准化 IS-IS 作为整个网络的控制平面,并标准化基于开放标准的 YANG 模型,使用 NETCONF 或 RESTCONF 传输来管理所有设备。
It is possible to purchase equipment from multiple vendors, and then rely on open standards implementations of control planes and management interfaces. For instance, a network operator may decide to standardize on IS-IS as a control plane throughout their network, and to standardize on open standards based YANG models, using a NETCONF or RESTCONF transport to manage all the devices.
当然,在尝试部署此类解决方案时需要考虑一些权衡。例如:
There are, of course, several tradeoffs to consider when attempting to deploy this type of solution. For instance:
• 供应商受财务驱动而支持功能,而不是管理界面。这意味着管理界面通常远远落后于网络产品新协议或功能的实施。
• Vendors are financially driven to support features, rather than management interfaces. This means management interfaces often lag far behind the implementation of new protocols or features on networking products.
• 至少在某种程度上,供应商不再支持标准化接口。通过支持标准化接口,供应商可以在任何部署中轻松更换——实际上,标准化接口使设备变得更加商品化。
• Vendors are, at least to some degree, driven away from supporting standardized interfaces. By supporting standardized interfaces, vendors are opening themselves up to being easily replaced in any deployment—in effect, standardized interfaces drive equipment into more of a commodity role.
• 标准机构通常在开发新功能或协议扩展方面进展缓慢。如果网络运营商等到新想法完全标准化,或者甚至看起来将以特定形式标准化,他们可能会面临失去竞争对手优势的风险。
• Standard bodies are often slow to produce new features or extensions to protocols. If a network operator waits until a new idea is fully standardized, or even looks like it will be standardized in a specific form, they could risk losing some advantage over their competitors.
当然,通常可以通过各种工具来掩盖接口差异——网络管理系统和像 Puppet 这样的开源工具通常可以通过 thunk 或硬件抽象层来找到一组通用的功能。然而,在使用这些类型的工具时,请记住抽象泄漏的问题。
It is often possible to cover up interface differences through a variety of tools, of course—network management systems and open source tools like Puppet can often work to find a common set of features through a thunk or hardware abstraction layer. However, keep the problem of leaky abstractions in mind when working with these sorts of tools.
网络复杂性中被忽视的领域之一是现实世界中部署的各种传输系统。还记得旧的以 IP 为中心的模型吗?一种传输通过多种电路,并且在顶部运行许多不同的应用程序?今天的现实是覆盖、隧道类型和传输散布在整个网络中的意大利面条,如图6.2(部分)所示。
One of the overlooked areas of network complexity is the wide array of transport systems deployed in the real world. Remember the old IP focused model, with one transport over a variety of circuits, and a lot of different applications running on top? The reality today is a spaghetti of overlays, tunnel types, and transports scattered throughout the network, as Figure 6.2 (partially) illustrates.
意大利面条运输系统有几个有趣(且难以管理)的功能,包括:
The spaghetti transport system has several interesting (and difficult to manage) features, including:
• 一些上层协议严重依赖于下层协议中包含的信息和生成的状态。例如,许多上层应用程序实际上依赖 IP 地址作为连接到网络的特定系统的一种标识符,尽管 IP 地址实际上是一个定位符。
• Several upper layer protocols are heavily reliant on information contained in and state generated by lower layer protocols. For instance, many upper layer applications actually rely on the IP address as a sort of identifier for a particular system connected to the network, although the IP address is actually a locator.
• 多个上层协议可以承载下层协议,从而创建复杂且重叠的堆栈。尽管虚拟可扩展局域网(VXLAN)隧道依赖于IP,但它实际上承载的是以太网帧;IPv4、IPv6 和通用路由封装 (GRE) 隧道(包括 GRE 隧道内承载的 MPLS)通常承载在 VXLAN 之上。
• Several upper layer protocols can carry lower layer protocols, creating a complex and overlapping stack. Although Virtual Extensible Local Area Network (VXLAN) tunnels rely on IP, it actually carries Ethernet frames; IPv4, IPv6, and Generic Routing Encapsulation (GRE) tunnels (including MPLS carried inside a GRE tunnel) are regularly carried on top of VXLAN.
• IPv4 和IPv6 可以作为并行网络协议运行,可以使用自己的控制平面或单独的控制平面。
• IPv4 and IPv6 can run as parallel network protocols, either with their own control plane or separate control planes.
这些协议中的每一个都是一个有漏洞的抽象;以这种方式将它们一个一个地放在另一个之上,特别是当协议设计和部署在“堆叠循环”中时,例如以太网上的 VXLAN 上的 IP 上的以太网,会导致每个抽象中的泄漏混合在一起,从而在管理和故障排除。该图中的每对协议也代表整个系统中的另一个交互界面。
Each of these protocols is a leaky abstraction; laying them one on top of another in this way, especially when protocols are designed and deployed in “stacking loops,” such as Ethernet over VXLAN over IP over Ethernet, causes the leaks in each abstraction to mingle, making a complete mess in terms of management and troubleshooting. Each pair of protocols in this illustration represent another interaction surface in the overall system, as well.
可以采取什么措施来缓解意大利面条运输系统的问题?将任何给定网络中运行的传输数量减少到尽可能少。设计人员应尝试选择最少数量的协议(包括覆盖层和底层),以支持所有应用程序和所有要求。这必然涉及对特定应用程序或系统的支持的权衡,但降低复杂性所带来的好处是非常值得的,因为它将减少 MTTR、控制平面状态和许多其他因素。以 MPLS 为例,单个 MPLS 部署可以支持以太网、IPv4 和 IPv6 虚拟通过非常小的传输系统与完全虚拟化进行链接。对于同一组问题,部署双栈 IPv4/IPv6 和 VXLAN 进行第 2 层虚拟化(在 IP 之上进行第 3 层虚拟化)是一种更为复杂的解决方案。
What can be done to alleviate this spaghetti transport system? Reducing the number of transports running in any given network to the minimum possible. Designers should try and choose the minimal number of protocols, both overlays and underlays, which will support all the applications and all the requirements. This will necessarily involve tradeoffs in support for specific applications or systems, but the gain in reduced complexity is well worth it, as it will reduce the MTTR, control plane state, and many other factors. Using MPLS as an example—a single MPLS deployment can support Ethernet, IPv4, and IPv6 virtual links with full virtualization across a very minimal transport system. Deploying dual-stack IPv4/IPv6 and VXLAN for Layer 2 virtualization (with IP on top for Layer 3 virtualization) is a much more complex solution to the same set of problems.
请记住,仅仅因为另一个协议可用,并不意味着您必须将其部署在您的网络中 - 仅仅因为您可以,并不意味着您应该这样做。
Remember that just because another protocol is available, it doesn’t mean you have to deploy it in your network—just because you can doesn’t mean you should.
仅对几种传输进行标准化的一个缺点可以通过返回 MPLS 来说明。在企业或数据中心级设备中获得 MPLS 支持要困难得多(或者更昂贵),因为 MPLS 被视为“传输服务提供商解决方案”。有时,我们自己的先入之见可能会导致我们根据解决方案的“用途”来选择解决方案,而不是根据它们真正有用的用途;最终,这会使我们的网络变得更加复杂,而不是更加复杂。
The one downside to standardizing on just a few transports can be illustrated by returning to MPLS. It’s much harder (or rather more expensive) to obtain MPLS support in enterprise or data center class equipment, as MPLS is seen as a “transit service provider solution.” Sometimes our own preconceptions can cause us to choose solutions based on what they’re “meant for,” rather than what they’re really useful for; in the end, this can make our networks more, rather than less, complex.
在设备级别,使用单一供应商的产品,或者以更完整的方式使用相同型号的设备(请参阅前面有关数据中心结构的示例),可以通过一致性来降低复杂性。相同的原理也可以应用在网络级别,方法是对网络中每种不同类型的模块进行编目,然后将每个模块构建得尽可能相似。例如,您可以将网络划分为一小组拓扑:
At the device level, using a single vendor’s products, or in a more complete way, the same model of device (see the previous example on data center fabrics), can reduce complexity through uniformity. The same principle can be applied at the network level, as well, by cataloging each of the different types of modules in the network, and then building each one to be as similar as possible. For instance, you might be able to divide a network into a small set of topologies:
• 校园
• Campus
• 数据中心
• Data Center
• 存在点
• Point of Presence
• 核
• Core
对于每个“位置”,确定一组角色以及与该角色相关的要求。对于每个引脚中的每个角色,部署一个支持各种要求的网络拓扑和配置。只要在以下方面保持纪律:
For each one of these “places,” determine a set of roles and the requirements that go with that role. For each role in each pin, deploy a single network topology and configuration that will support the range of requirements. So long as discipline is maintained in:
• 保持单个“地点”的每个实例的配置和部署在时间和地点上都相同。
• Keeping the configurations and deployments of every instance of a single “place” the same across time and location.
• 将“地点”和角色的数量保持在尽可能的最低限度。
• Keeping the number of “places” and roles to the minimum possible.
设计、部署、管理和排除这些模块故障所需的工作量大大减少。
The amount of work required to design, deploy, manage, and troubleshoot these modules is greatly diminished.
在数据中心内,这种思维方式也经常用于“pod”或“模块”。数据中心内的每组机架都可以被视为一个模块化单元,其中数据中心以单元(而不是设备)的形式增长,并且单元(而不是单个设备)被升级或更换。像往常一样,可互换模块会带来许多问题。
Within the data center, this type of thinking is often used of “pods,” or “modules,” as well. Each set of racks within a data center can be considered a modular unit, where the data center grows in units (rather than devices), and units are upgraded or replaced, rather than individual devices. There are, as usual, a number of problems that come along with interchangeable modules.
首先,可互换零件通常不是。尽管网络管理者做了很多工作来使每个模块相同,但本地条件、跨时间分布的情况都违背了一致性。当制造商更换设备线或将一个设备替换为另一个设备时,确保大量相同模块的一致性很快就会变得困难。当互连网络中的不同位置 (PIN) 时,这也是一个问题。例如,如果您决定升级 PIN 之间的连接速度,则需要修改整个网络中的每个 PIN。面对此类问题,模块化通常很难维护。
First, interchangeable parts often aren’t. In spite of all the work network managers do to make each module identical, local conditions, spread across time, work against consistency. As manufacturers replace lines of equipment, or replace one device with another, it quickly becomes difficult to ensure consistency across a large number of identical modules. This is also a problem when interconnecting different Places In the Network (PINs). As an example, if you decide to upgrade the connection speed between PINs, you will need to modify every PIN in the entire network. Modularity is often difficult to maintain in the face of such problems.
其次,经常隐藏在可互换模块背后的一个问题是“网络中的位置”综合症,其中模块成为所有网络设计和操作的焦点,从而使网络作为一个整体或作为一个系统被排除在外。予以考虑。这有点像建造一座房子,选择许多你认为你想要的不同房间,然后选择一条走廊将它们连接起来。从住在房子里的人的角度来看,这可能真的很好,但建筑商在尝试建造客户想要的东西时将面临一些真正的麻烦,而且维护将是一场灾难。
Second, a problem often hidden behind interchangeable modules is the “places in the network” syndrome, where the modules become the focus of all network design and operations, leaving the network as a whole, or as a system, off the table as something to be considered. It’s a bit like building a house by choosing a lot of different rooms you think you want, and then choosing a hallway to connect them all. It might be really nice from the perspective of the person living in the house, but the builder is going to face some real trouble in trying to build what the customer wants—and maintenance is going to be a disaster.
笔记
Note
有关 PIN 的更完整说明,请参阅本章后面的“网络中的位置”部分。
See the section on “Places in the Network” later in this chapter for a more complete description of PINs.
网络工程师经常将模块化与分层网络设计和聚合等同起来,但这两者并不完全相同。分层网络设计是一门以模块化为原则的学科,但并非所有分层网络设计都是模块化的,也不是所有模块化网络设计都是分层的。同样,聚合通常取决于模块化,但您可以构建一个非常模块化的网络,根本不聚合任何信息。那么,如果聚合和层次结构不是重点,那么模块化以及此处给出的示例(统一硬件、统一管理、统一传输和可互换模块)如何真正影响网络设计的复杂性?
Network engineers often equate modularity with hierarchical network design and aggregation, but these two are not precisely the same thing. Hierarchical network design is a discipline that uses modularity as one of its principles, but not all hierarchical network designs are modular, nor are all modular network designs hierarchical. In the same way, aggregation often depends on modularity, but you can build a very modular network that doesn’t aggregate any information at all. So, if aggregation and hierarchy are not the point, how does modularity, and the examples given here—uniform hardware, uniform management, uniform transport, and interchangeable modules—really impact the complexity of a network’s design?
虽然真正的模块化设计通常能够减少状态,但模块化设计本身实际上并不能减少状态。对于速度也是如此——构建模块化设计可能会提供降低信息通过控制平面传播的速度的机会,但它不会直接影响网络运行的速度。
While a truly modular design often leads to the ability to reduce state, modular designs, themselves, don’t actually reduce state. The same can be said for speed, as well—building a modular design might provide opportunities to reduce the speed at which information is spread through the control plane, it doesn’t directly impact the speed at which the network operates.
因此, Surface是模块化降低网络设计复杂性的主要手段。通过将网络分解为多个较小的部分,模块化减少了网络中交互表面的大小,如图6.3所示。
Surface, then, is the primary means through which modularity reduces complexity in network design. By breaking the network up into multiple smaller pieces, modularity reduces the size of interaction surfaces in the network, as illustrated in Figure 6.3.
从管理和控制平面的角度来看,模块化本身通过减小交互表面的大小,在一定程度上降低了网络复杂性。但是,如上所述,模块化并没有真正解决网络复杂性的状态或速度方面。然而,模块化确实可以实现信息隐藏,而信息隐藏会直接攻击状态和速度。让我们看一下网络设计中常用的两种信息隐藏形式:聚合和虚拟化。
Modularization, on its own, provides some reduction in network complexity, both from a management and a control plane perspective, by reducing the size of the interaction surfaces. But, as above, modularization doesn’t really attack the state or speed aspects of network complexity. Modularity does, however, enable information hiding—and information hiding directly attacks state and speed. Let’s look at two forms of information hiding commonly used in network design: aggregation and virtualization.
到目前为止,聚合是网络中最常见的信息隐藏形式。实际上,网络设计中使用了两种类型的聚合,尽管它们经常被混为一谈:隐藏拓扑信息和总结可达性信息。了解两者之间差异的最简单方法是检查链路状态协议(例如 IS-IS)中的聚合。图 6.4将用于说明差异。
Aggregation is, by far, the most common form of information hiding in networks. There are actually two types of aggregation in use in network design, although they are often conflated into the same thing: hiding topology information and summarizing reachability information. The easiest way to understand the difference between the two is by examining aggregation in a link state protocol, such as IS-IS. Figure 6.4 will be used to illustrate the difference.
给定如图所示的网络,除了将路由器 A 和 B 放置在单个 2 级泛洪域中,并将路由器 B、C、D 和 E 放置在单独的 1 级泛洪域中之外,无需任何配置,路由器 B 在其 1 级中会看到什么IS-IS 拓扑数据库与路由器 A 在其 2 级拓扑数据库中看到的内容?
Given the network as illustrated, with no configurations beyond placing Routers A and B in a single level 2 flooding domain, and Routers B, C, D, and E in a separate level 1 flooding domain, what would Router B see in its level 1 IS-IS topology database versus what Router A sees in it level 2 topology database?
路由器B:
Router B:
• 2001:db8:0:1::/64 连接到 D
• 2001:db8:0:1::/64 connected to D
• 2001:db8:0:2::/64 连接到 E
• 2001:db8:0:2::/64 connected to E
• 路由器 B => 路由器 D
• Router B => Router D
• 路由器 D => 路由器 B
• Router D => Router B
• 路由器E => 路由器D
• Router E => Router D
• 路由器D => 路由器E
• Router D => Router E
• 路由器E => 路由器C
• Router E => Router C
• 路由器C => 路由器B
• Router C => Router B
• 路由器 B => 路由器 C
• Router B => Router C
• 路由器 B 连接到 2 级路由域(路由器 B 的 LSP 中设置的连接位)
• Router B connected to a level 2 routing domain (the connected bit set in Router B’s LSP)
路由器A:
Router A:
• 可通过路由器 B 访问 2001:db8:0:1::/64
• 2001:db8:0:1::/64 reachable through Router B
• 可通过路由器 B 访问 2001:db8:0:2::/64
• 2001:db8:0:2::/64 reachable through Router B
• 路由器A => 路由器B
• Router A => Router B
• 路由器 B => 路由器 A
• Router B => Router A
检查这两个列表,最大的区别如下:
Examining these two lists, the big differences are as follows:
• 两个可达子网的连接点都附在Router B 的路由信息上,而不是附在Router D 和E 上。
• The connection points for the two reachable subnets are attached to Router B’s routing information, rather than Routers D and E.
• 路由器B、C、D 和E 之间的链路均从拓扑数据库中删除。
• The links between Routers B, C, D, and E are all removed from the topology database.
通过将 1 级泛洪域内的所有可达目的地附加到 1 级/2 级泛洪域边界,Router A 不需要了解 1 级拓扑的内部详细信息。在像IS-IS这样的链路状态协议中创建多个泛洪域的目的实际上是阻止拓扑信息从“外围”泛洪域进入“核心”泛洪域——从1级泛洪域进入2级泛洪域。泛洪域。然后,路由器 B 聚合拓扑信息,仅携带可到达的目的地和到达这些目的地的成本,而不是将 1 级洪泛域内每对路由器之间的链路状态传递给路由器 A。聚合此拓扑信息有何作用完成?两件事情:
By attaching all the reachable destinations within the level 1 flooding domain to the level 1/level 2 flooding domain border, Router A doesn’t need to know the internal details of the level 1 topology. The point of creating multiple flooding domains in a link state protocol like IS-IS is, in fact, to block the topology information from “outlying” flooding domains into the “core” flooding domain—from the level 1 flooding domains into the level 2 flooding domain. Router B, then, is aggregating the topology information, carrying just the reachable destinations and the cost to reach them, rather than passing the link state between each pair of routers within the level 1 flooding domain to Router A. What does aggregating this topology information accomplish? Two things:
拓扑信息的聚合通过从级别 2 泛洪域中删除级别 1 泛洪域中承载的链路信息,减少了控制平面中承载的状态,从而减小了拓扑数据库的大小以及 SPF 算法必须的树的大小计算跨度。
• Aggregation of topology information decreases the state carried in the control plane by removing the link information carried in the level 1 flooding domain from the level 2 flooding domain, hence reducing the size of the topology database and the size of the tree the SPF algorithm must calculate across.
• 拓扑信息的聚合通过从二级数据库中删除有关一级泛洪域内链路状态的信息,降低了信息流经控制平面的速度。它例如,对于路由器 A 来说,路由器 D 和 E 之间的链路状态是什么并不重要,只要从路由器 B 的角度来看,可达目的地保持不变即可。通过阻止拓扑信息流入第 2 级链路状态拓扑数据库、聚合会减慢更新速度,从而减慢路由器 A 对这些更新做出反应的速度。
• Aggregation of topology information decreases the speed of the information flowing through the control plane by removing information about the state of the links within the level 1 flooding domain from the level 2 database; it doesn’t matter to Router A what the state of the link between Routers D and E are, for instance, so long as the reachable destinations remain constant from the perspective of Router B. By blocking topology information from flowing into the level 2 link state topology database, aggregation slows down the pace of updates, and hence the pace at which Router A must react to those updates.
通过将路由器 B 配置为聚合其所连接的 1 级泛洪域中的可达性信息,可以进一步进行聚合。在这种情况下,两个路由 2001:db8:0:1::/64 和 2001:db8:0:2::/64 可以聚合到一个较短的前缀路由 2001:db8::0/61 。如果您在配置此聚合后检查路由器 A 的 2 级拓扑数据库,您会发现它现在稍微小一些。
Aggregation can be taken one step further by configuring Router B to aggregate the reachability information in the level 1 flooding domain to which it’s attached. In this case, the two routes, 2001:db8:0:1::/64 and 2001:db8:0:2::/64, can be aggregated to a single shorter prefix route, 2001:db8::0/61. If you examined Router A’s level 2 topology database after this aggregation is configured, you’d find it is now slightly smaller.
• 可通过路由器 B 访问 2001:db8::0/64
• 2001:db8::0/64 reachable through Router B
• 路由器A 连接到路由器B
• Router A connected to Router B
• 路由器 B 连接到路由器 A
• Router B connected to Router A
再次从状态和速度的角度来审视这一点:
Once again, examining this from the perspective of state and speed:
• 可达性信息的聚合通过将两个可达目的地减少为单个可达目的地来减少控制平面中承载的状态。
• Aggregation of reachability information decreases the state carried in the control plane by reducing the two reachable destinations to a single reachable destination.
• 拓扑信息的聚合通过删除有关 1 级洪泛域内两个单独可达目的地的状态的信息,降低了信息流经控制平面的速度。无论 2001:db8:0:1::/64 或 2001:db8:0:2::/64 的状态如何,2001:db8::0/61 的状态保持不变。
• Aggregation of topology information decreases the speed of the information flowing through the control plane by removing information about the state of the two individual reachable destinations within the level 1 flooding domain. No matter what the state of either 2001:db8:0:1::/64 or 2001:db8:0:2::/64 are, the state of 2001:db8::0/61 remains constant.
与所有抽象一样,聚合也有一系列权衡(一如既往,天下没有免费的午餐)。上一章花了很多时间讨论聚合对拉伸的影响——这里不再重复讨论,但记住这一点很重要。然而,除了延伸之外,网络工程师还应该记住,聚合还受到抽象泄漏定律的约束,如上面的侧边栏所述。聚集体如何泄漏?图 6.5将聚合展示为一种有漏洞的抽象。
Aggregation, like all abstractions, has a set of tradeoffs (as always, there is no such thing as a free lunch). The last chapter spent a good deal of time discussing the impact of aggregation on stretch—that discussion won’t be repeated here, but it’s important to bear in mind. Beyond stretch, however, network engineers should remember that aggregation is also subject to the law of leaky abstractions, explained in a sidebar above. How do aggregates leak? Figure 6.5 illustrates aggregation as a leaky abstraction.
路由器 B 配置为使用两个组件路由中的最低度量将 2001:db8:0:1::/64 和 2001:db8:0:2::/64 聚合到 2001:db8::/61。在这种情况下,度量将从 [B,D] 路径中选择,因此聚合将为通告的度量为 2。但是,如果路由器 B 和 D 之间的链路发生故障,则组件中最低的度量将是路径 [B,C,E],总度量为 3。当该链路发生故障时,聚合路由会将度量从成本 2 更改为成本 3,因此抽象通过聚合泄漏了有关拓扑的信息。当然,堵住这个漏洞是可能的。然而,重要的是要记住,这里使用的任何解决方案都需要考虑其自身的一系列权衡。
Router B is configured to aggregate 2001:db8:0:1::/64 and 2001:db8:0:2::/64 to 2001:db8::/61 using the lowest metric from the two component routes. In this case, the metric would be chosen from the [B,D] path, so the aggregate would be advertised with a metric of 2. If the link between Router B and D fails, however, the lowest metric among the components would be the path [B,C,E], with a total metric of 3. When this link fails, the aggregate route will change metric from a cost of 2 to a cost of 3—hence the abstraction has leaked information about the topology through the aggregation. It is possible, of course to plug this leak; however, it’s important to remember that any solution used here will have its own set of tradeoffs to consider.
如果不对故障域进行一些检查,对信息隐藏的讨论就不完整。聚合和虚拟化都用于限制故障域的大小,但什么是故障域?它与复杂性有何关系?首先,我们来定义什么是故障域,然后看看故障域和网络复杂性之间的关系。图 6.6说明了故障域。
No discussion of information hiding would be complete without some examination of failure domains. Both aggregation and virtualization are used to limit the size of a failure domain—but what is a failure domain, and how does it relate to complexity? To begin, let’s define what a failure domain is, then let’s look at the relationship between failure domains and network complexity. Figure 6.6 illustrates failure domains.
该 IS-IS 网络中的路由器 B 配置为将 2001:db8:0:1::/64 和 2001:db8:0:2::/64 聚合成一条路由 2001:db8::/61,即然后通告到 2 级泛洪域(朝向路由器 C)。2001:db8:0:8::/64 不属于此聚合范围,因此它“按原样”发布到第二级泛洪域。使用此配置,您可以构建网络中所选路由器的数据库与这三个路由相关的高级概述。
Router B in this IS-IS network is configured to aggregate 2001:db8:0:1::/64 and 2001:db8:0:2::/64 into a single route, 2001:db8::/61, which is then advertised into the level 2 flooding domain (toward Router C). 2001:db8:0:8::/64 does not fall within this aggregate, so it is advertised “as is” into the level two flooding domain. Using this configuration, you can build a high-level overview of what the database for a selection of routers in the network will look like in relation to these three routes.
•路由器 B,1 级数据库:
• Router B, Level 1 Database:
• 路由器B 连接到路由器A。
• Router B is connected to Router A.
• RouterB 连接到2 级泛洪域。
• Router B is connected to a level 2 flooding domain.
• 路由器A 连接到路由器B。
• Router A is connected to Router B.
• 路由器A 连接到2001:db8::0/64。
• Router A is connected to 2001:db8::0/64.
• 路由器A 连接到2001:db8:0:2::/64。
• Router A is connected to 2001:db8:0:2::/64.
• 路由器A 连接到2001:db8:0:8::/64。
• Router A is connected to 2001:db8:0:8::/64.
•路由器 B、C 和 D,2 级数据库:
• Routers B, C, & D, Level 2 Database:
• 路由器B 连接到路由器C。
• Router B is connected to Router C.
• 路由器C 连接到路由器B。
• Router C is connected to Router B.
• 路由器C 连接到路由器D。
• Router C is connected to Router D.
• 路由器D 连接到路由器C。
• Router D is connected to Router C.
• 路由器B 连接到2001:db8::/61。
• Router B is connected to 2001:db8::/61.
• 路由器B 连接到2001:db8:0:8::/64。
• Router B is connected to 2001:db8:0:8::/64.
•路由器E:
• Router E:
• 路由器D 连接到路由器E。
• Router D is connected to Router E.
• 路由器E 连接到路由器D。
• Router E is connected to Router D.
• RouterD 连接到2 级泛洪域。
• Router D is connected to a level 2 flooding domain.
检查网络中各个位置的链路状态数据库的高级视图将显示以下内容:
Examining this high-level view of the link state databases at various places in the network would show the following:
• 如果2001:db8:0:1::/64 或2001:db8:0:2::/64 与Router A 断开,则只有Router A 和B 需要重新计算其SPT。
• If 2001:db8:0:1::/64 or 2001:db8:0:2::/64 are disconnected from Router A, only Routers A and B will need to recalculate their SPT.
• 如果2001:db8:0:8::/64 与路由器A 断开连接,则路由器B、C 和D 将需要重新计算其SPT。
• If 2001:db8:0:8::/64 is disconnected from Router A, Routers B, C, and D will need to recalculate their SPT.
因此,隐藏信息可以减少必须重新计算其 SPT 以响应网络拓扑变化的路由器数量。事实上,这是您可能会发现的故障域的最佳定义:
Hiding information, then, reduces the number of routers that must recalculate their SPT in reaction to a change in the network topology. This is, in fact, as good a definition of a failure domain as you are likely to find:
故障域是当网络拓扑或可达性发生变化时必须与控制平面交互的一组设备。
A failure domain is the set of devices that must interact with the control plane when network topology or reachability changes.
根据此定义,如果 2001:db8:0:2::/64 与路由器 A 断开连接,则故障域仅包含路由器 A 和 B。在这种情况下,路由器 C 到 E 不包含在故障域中,因为该信息在他们的控制平面中并没有改变。因此,通过聚合隐藏信息,可以减小故障域的大小。这缩小了控制平面内交互表面的范围,并降低了控制平面接收新信息的速度。
Given this definition, if 2001:db8:0:2::/64 is disconnected from Router A, the failure domain contains only Routers A and B. Routers C through E are not included in the failure domain in this case, because the information in their control planes doesn’t change. So by hiding information through aggregation, the size of the failure domain has been reduced. This reduces the scope of the interaction surfaces within the control plane, as well as reducing the speed at which the control plane receives new information.
此示例中需要注意的另一点是,故障域不是您可以在网络的任何特定部分周围绘制的“实线”。一些信息(例如上面给出的聚合度量变化的示例)总是会通过网络中隐藏信息的任何点泄漏(请参阅泄漏抽象定律的前边栏),因此实际上存在许多不同的重叠故障任何给定网络中的域。
Another point to note from this example is that the failure domain is not a “solid line” you can paint around any particular part of the network. Some information (such as the example of the aggregate’s metric changing given above) will always leak through any point where information is being hidden in the network (see the previous sidebar on The Law of Leaky Abstractions), so there are actually many different overlapping failure domains in any given network.
以一种或另一种形式隐藏信息是网络设计者应对复杂性的最佳工具之一。通过隐藏模块之间的信息:
Information hiding in one form or another is one of the best tools the network designer has to deal with complexity. By hiding information between modules:
• 控制平面中携带的状态量减少。聚合消除了可达性和拓扑信息,从而减少了控制平面中承载的信息总量。虚拟化(在上一章中讨论)将拓扑和可达性信息分解在多个控制平面之间,以便每个控制平面仅管理总网络状态的一个子集。
• The amount of state carried in the control plane is reduced. Aggregation removes reachability and topology information, reducing the total amount of information carried in the control plane. Virtualization (discussed in the last chapter) breaks the topology and reachability information up among multiple control planes, so that each control plane only manages a subset of the total network state.
•控制平面的速度降低。例如,聚合通过完全删除状态(拓扑)或用易失性较小的状态替换易失性较高的状态(可达性信息的汇总)来降低控制平面承载信息的速度。虚拟化通过将更改分散到多个控制平面来降低控制平面的更改速度(重要的是要记住,虚拟化对控制平面操作速度的影响小于聚合)。
• The speed of the control plane is reduced. Aggregation, for instance, reduces the speed at which the information the control plane is carrying by either removing state entirely (topology) or replacing more volatile state with less volatile state (summarization of reachability information). Virtualization reduces the speed of changes in the control plane by spreading the changes across multiple control planes (it’s important to remember that virtualization impacts the speed of control plane operations less than aggregation does).
• 交互面主要通过模块化包含在聚合中,并在网络中创建“阻塞点”,限制控制平面各个部分交互的位置。虚拟化通过将特定应用程序或客户绑定到特定逻辑拓扑来减小交互界面的大小,从而允许将每组应用程序视为独立于网络上运行的其余应用程序的情况。当然,虚拟化并不像聚合那么简单,因为虚拟拓扑本身必须交互,从而在网络中创建另一个交互表面。
• The interaction surfaces are contained in aggregation primarily through modularization, and the creation of “choke points” in the network limiting the places where various pieces of the control plane interact. Virtualization reduces the size of the interaction surfaces by tying specific applications or customer to specific logical topologies, thus allowing each set of applications to be treated as a case that’s independent of the rest of the applications running on the network. Of course, virtualization isn’t quite as straight forward as aggregation, because the virtual topologies themselves must interact, creating another interaction surface in the network.
模型不是处理复杂性的“网络”工具,而是一种对网络中发生的情况进行分类和抽象的方法。大多数网络工程师通常熟悉七层和四层模型,这些是抽象协议操作的好模型。对于理解整个网络的部署和操作,或者网络和应用程序的操作,有用的模型怎么样?本节介绍了三种不同的模型,最后介绍了一种建模语言。
Models aren’t an “on network” tool to deal with complexity, but rather a way of categorizing and abstracting out what is happening in the network. Most network engineers are generally familiar with the seven and four layer models—and these are good models for abstracting the operation of protocols. What about useful models for understanding the deployment and operation of the network as a whole, or the operation of the network and applications? Three different models are presented in this section, and finally a modeling language.
瀑布模型不是网络运行的模型,而是流量的模型。该模型基于这样的基本见解:所有路由和交换协议本质上都会构建一棵源于目的地的树,然后分布到每个可用的源。图 6.7显示了瀑布模型。
The waterfall model isn’t a model of network operation, but rather a model of traffic flows. This model is based on the basic insight that all routing and switching protocols essentially build a tree sourced at the destination, and spread out to each available source. Figure 6.7 shows the waterfall model.
瀑布模型显示了数据流如何在每个网络设备上在网络中分裂,以便单个流变成一组流。一旦任何网络数据流分裂,就无法将这些流重新连接回单个数据流数据流,而不存在网络拓扑中出现环路的风险。要了解这与网络建模有何关系,请考虑瀑布模型中生成树和路由操作之间的差异。
The waterfall model shows how data flow splits in a network at every network device so that a single stream becomes a set of streams. Once any network data stream has split, there is no way to rejoin the streams back into a single stream of data without risking loops in the network topology. To understand how this relates to modeling a network, consider the difference between spanning tree and routing operation in a waterfall model.
生成树构建一棵大树,由每个源和目标对共享。这棵树意味着所有交通都必须“跃上瀑布”到达头端,然后才能跟随水流回到真正的目的地。例如,沿着流 W 流向流 Z 末尾的数据包必须沿着流到达其 A 处的源,然后沿着流 Z 的路径。
Spanning tree builds one large tree which is shared by every source and destination pair. This single tree means that all traffic must “leap up the waterfall” to the head end, where it can then follow the flow of water back toward its real destination. For instance, a packet flowing along Stream W toward the end of Stream Z must follow the stream to its source at A, and then follow the path for Stream Z.
对于路由控制平面,每个边缘设备都会构建自己的树,因此每个设备都位于瀑布或生成树的头端。每条交通流都沿着自己的一组溪流流下,而不是“跃上”瀑布到达顶部,然后流回瀑布。在此给出的示例中,如果流量从流 W 的入口点流向流 Z 的端点,它将遵循源自路由器 C 的不同树。这就是路由协议比生成树更高效的原因;流量不需要“跳跃”树来到达目的地。
For a routed control plane, each edge device builds its own tree, so each device is at the head end of a waterfall, or a spanning tree. Rather than “leaping up” the waterfall to reach the top, and then flowing back down, each traffic stream follows its own set of streams down. In the example given here, if traffic were flowing from an entrance point at Stream W to an endpoint at Stream Z, it would follow a different tree originating at Router C. This is why a routing protocol is more efficient than spanning tree; traffic doesn’t need to “leap up” the tree to reach its destination.
PIN 是一种沿着功能链路而不是拓扑链路划分网络的方法。图 6.8说明了网络的 PIN 视图。
PINs are a way to divide a network along functional, rather than topological links. Figure 6.8 illustrates a PINs view of a network.
网络的每个功能部分被分为不同的组件,例如两个数据中心和两个园区,并且它们通过互连点(图中标记为IC)进行连接。以这种方式分割网络可以强调网络每个部分的功能。每个部分可以使用不同的设计范例,以匹配所设计的 PIN 的特定用途。例如,大规模的中心辐射型拓扑可能在零售环境中占主导地位,而第一个数据中心可能使用传统的交换以太网拓扑进行设计,第二个数据中心则采用 Clos 拓扑。
Each functional section of the network is separated into a different component, such as the two data centers and the two campuses, and they are connected using interconnection points (marked IC in the illustration). Splitting the network up in this way emphasizes the function of each piece of the network. Different design paradigms can be used in each section, to match the specific purpose of the PIN being designed. For instance, a large-scale hub and spoke topology might dominate the retail environment, while the first data center might be designed using a traditional switched Ethernet topology, and the second as a Clos topology.
每个 PIN 的安全性、管理机制和其他方面也可以不同;数据中心 1 可能在数据中心内部具有开放的安全策略,并且对外部访问有严格的限制,而数据中心 2 可能没有入口策略,但每个服务器/应用程序的安全机制都很强大。每个 PIN 码对于网络中的其他 PIN 码都是完全不透明的。数据中心 1 只是数据中心 2 的流量接收器,WAN PIN 只是到达网络中其他 PIN 的传输机制。这是严格类型模块化的一种形式。
Security, management mechanisms, and other aspects of each PIN can also be different; Data Center 1 might have an open security policy within the data center itself, and strong restrictions on outside access, while Data Center 2 might have no entrance policies, but strong per server/application security mechanisms. Every PIN is completely opaque to every other PIN in the network. Data Center 1 is simply a traffic sink for Data Center 2, and the WAN PIN is simply a transport mechanism to reach the other PINs in the network. This is a form of strictly typed modularity.
连接这些不同的 PIN 是一系列互连,如图6.8中的浅灰色圆圈所示。某些 PIN 可能会直接相互连接以及连接到广域网 (WAN) 或核心 PIN,如 DMZ 所示和数据中心 2 个 PIN。其他人可能只连接到核心。每个 PIN 的内部结构与其他 PIN 完全不同。例如,数据中心 1 可能具有核心层、分布层和接入层,而数据中心 2 可能仅具有核心层和聚合层。这些层完全独立于整体网络设计。
Connecting these different PINs is a series of interconnects, shown as light gray circles in Figure 6.8. Some PINs might connect directly to each other as well as to the Wide Area Network (WAN), or Core PIN, as illustrated with the DMZ and Data Center 2 PINs. Others might connect only to the Core. Each PIN’s internal structure is completely different from every other PIN’s. For instance, Data Center 1 might have core, distribution, and access layers, and Data Center 2 might have only a core and aggregation layers. These layers are completely independent of the overall network design.
PIN 对于从操作角度理解网络设计非常有用,因为它们提供了基于每个 PIN 业务用途的强大功能视图。这使得每个业务问题都可以独立处理,这通常可以澄清所涉及的问题。供应商销售人员倾向于几乎完全在 PIN 内工作,因为它有助于将解决方案缩小到特定环境,从而有助于推动需求。
PINs are useful for understanding a network design from an operational perspective, because they provide a strong functional view based on business use for each PIN. This allows each business problem to be approached independently, which can often clarify the problems involved. Vendor sales folks tend to work within PINs almost exclusively, because it helps to narrow the solution to a particular environment, helping to drive requirements.
作为一种模型,PIN 在一个特定方面失败了——它们将网络架构集中在自下而上的功能视图上。这确实允许网络更紧密地模仿所需的功能;它将整体架构推到了视线之外。这可能会导致系统架构“有机增长”,而不是产生经过深思熟虑的总体规划和架构。
As a model, PINs fail in one particular regard—they focus the network architecture on a bottom up view of functionality. This does allow the network to more closely mimic the functionality required; it pushes the overall architecture out of sight. This can result in a systemic architecture that “just grows organically,” rather than producing a well thought out overall plan and architecture.
笔记
Note
PIN 的其他几个积极和消极方面与本章前面讨论的可互换模块的积极和消极方面类似。
Several other positive and negative aspects of PINs are similar to the positive and negative aspects of interchangeable modules discussed earlier in this chapter.
基于无标度网络的分层网络模型与网络本身一样古老。本质上,分层设计就是采用模块化规则,将其与流量的瀑布模型相结合,最后将两者与信息隐藏的聚合相结合,并构建一套通常适用于大约任何网络设计项目。图 6.9说明了基本的分层设计。
Hierarchical network models, grounded in scale free networks, are as old as networks themselves. Hierarchical design is, in essence, taking the rules of modularity, combining them with a waterfall model of traffic flow, and finally combining these two with aggregation for information hiding, and building a set of “rules of thumb” that generally work for just about any network design project. Figure 6.9 illustrates a basic hierarchical design.
笔记
Note
Cisco Press 的书籍《最佳路由设计》详细介绍了分层网络设计。2现在,许多网络都是两层设计,而不是三层,并采用“层内层”的方式进行扩展。
Hierarchical network design is covered in detail in the Cisco Press book, Optimal Routing Design.2 Many networks are now two layer designs, rather than three, with “layers within layers” to build out to scale.
2 . Russ White、Alvaro Retana 和 Don Slice,《最佳路由设计》(Cisco Press,2005 年)。
2. Russ White, Alvaro Retana, and Don Slice, Optimal Routing Design (Cisco Press, 2005).
Consider some of the “rules of thumb” for hierarchical network design in network complexity terms.
分层设计中的三个层(访问层、分发层和核心层)中的每一层都应专注于一小组功能或目的。例如,网络核心应该专注于在分布层模块之间转发流量,而不是执行策略,甚至连接到外部网络。例如,互联网和外部合作伙伴网络不应连接到网络核心,而应连接到与其他接入层模块并行的接入层模块。
Each of the three layers in the hierarchical design—access, distribution, and core—should be focused on a small set of functions or purposes. For instance, the network core should be focused on forwarding traffic between the distribution layer modules, rather than implementation of policy, or even connectivity to outside networks. For instance, the Internet and outside partner networks should not be connected to the network core, but rather to an access layer module parallel to the other access layer modules.
将功能集中在每一层内有助于通过控制网络模块内部和之间的交互表面来管理复杂性。例如,通过专注于分布层模块之间的转发,可以大大简化网络核心设备的配置——准入策略和每个用户或设备的安全性可以从这些设备中剔除,因为这些功能是在网络中的其他地方处理的。限制任何特定功能的位置还可以简化设备选择,并允许层内的模块更加相似。例如,如果长距离暗光纤接口的高速转发是一致推送到网络核心设备的功能,则可以选择分布层设备,而无需考虑需要支持的接口类型这个能力。
Focusing functionality within each layer helps manage complexity by controlling the interaction surfaces within and between the network modules. By focusing on forwarding between distribution layer modules, for instance, the configuration of the network core devices can be greatly simplified—admittance policy and per user or device security can be left off these devices, as those functions are handled someplace else in the network. Restricting the location of any particular function also simplifies equipment choices, and allows the modules within the layer to be more similar. As an example, if high-speed forwarding over a long distance dark fiber interface is a function consistently pushed to the network core devices, then distribution layer devices can be chosen without reference to the types of interfaces required to support this capability.
分层设计模式还在网络拓扑中提供了方便的“瓶颈”。在拓扑的边缘,沿着用户连接点到接入层,并且在每层之间,有较少数量的连接将拓扑拉在一起。这些地方的连接量更多地受到网络设计的限制,是以更集中的方式实施政策的完美场所。
The hierarchical design pattern also provides convenient “choke points” in the network topology. At the edge of the topology, along the user connection point to the access layer, and between each layer, there are a smaller number of connections that pull the topology together. These places, where the amount of connectivity is more limited by the design of the network, are perfect places to implement policy in a more centralized way.
作为说明,将聚合视为一种策略。在分层设计中的层之间的链接上配置聚合是有意义的,如下所示:
By way of illustration, consider aggregation as a policy. It makes sense to configure aggregation on the links between layers in a hierarchical design, as this:
• 提供每个模块内的完整路由信息。
• Provides full routing information within each module.
• 提供一小组位置来查找每个模块内的聚合。
• Provides a small set of places to look for aggregation within each module.
• 提供一小组位置来查找模块之间的次优流量分配。
• Provides a small set of places to look for suboptimal traffic distribution between the modules.
• 在模块之间的控制平面状态之间提供分隔点。
• Provides a separation point between the control plane state between modules.
在拓扑中提供“阻塞点”允许网络模块之间的受控交互表面,在这些点上减少模块之间承载的控制平面状态,进而提供一种控制控制平面必须的速度的方法管理网络变更。
Providing “choke points,” in the topology allows for controlled interaction surfaces between the network modules, points at which to reduce the control plane state being carried between modules, and, in turn, a way to control the speed at which the control plane must manage network changes.
笔记
Note
围绕聚合的权衡将在本章前面的信息隐藏部分和第 5 章“设计复杂性”中讨论。围绕策略布局的权衡将在第 5 章“设计复杂性”中讨论。
The tradeoffs around aggregation are discussed in the Information Hiding section earlier in this chapter and Chapter 5, “Design Complexity.” The tradeoffs around policy placement are discussed in Chapter 5, “Design Complexity.”
统一建模语言(UML)在这里可能看起来不合适,因为它是一个建模工具,而不是一个模型或系统。然而,网络工程界往往“凭感觉行事”,而不是有意思考网络与应用程序和策略交互的方式。以流程为中心的建模语言对于理解应用程序的工作细节非常有帮助,然后可以将其直接映射到网络中的策略和数据包流。在大规模数据中心和云部署中尤其如此。尽管大规模云和数据中心部署被设计为“与应用程序无关”,但在了解数据包流以及策略实施位置时,通常需要对数据中心结构上运行的应用程序进行故障排除,
The Unified Modeling Language (UML) might seem out of place here, because it’s a modeling tool, rather than a model or a system. However, the network engineering world tends to “fly by the seat of its pants,” rather than intentionally think through the way the network interacts with applications and policies. A process focused modeling language can be really helpful in understanding how an application works in some detail, which can then be directly mapped to policies and packet flows in the network. This is particularly true in large-scale data center and cloud deployments. Although large-scale cloud and data center deployments are designed to be “application agnostic,” there often comes a point in troubleshooting an application running on a data center fabric when understanding the packet flow, and where policies are put in place, is really helpful in determining where the problem lies.
图 6.10说明了 Web 应用程序的 UML 模型。
Figure 6.10 illustrates a UML model for a web application.
虽然图 6.10所示的图重点关注组成 Web 应用程序的不同进程之间的交互,但应用程序之间的交互都是通过网络发生的,因此,每次交互都代表必须规划的数据包流。检查此图,网络工程师可能会提出以下问题:
While the diagram shown in Figure 6.10 focuses on the interaction between the different processes which make up the web application, the interactions between the applications all happen across the network—as such, each interaction represents a flow of packets that must be planned for. Examining this diagram, a network engineering might ask questions like:
• 每个进程之间使用什么类型的连接?如果是第 2 层连接,则必须将进程放入单个第 2 层域中,或者必须以其他方式处理流量,以便进程相信它们是通过第 2 层链路进行连接的。
• What type of connectivity is used between each of these processes? If it’s Layer 2 connectivity, then the processes must be placed into a single Layer 2 domain, or the traffic must otherwise be handled so the processes believe they are connected over a Layer 2 link.
• 每个连接使用什么协议?例如,TCP 可以用在超文本标记语言(HTML)渲染组件和会话控制之间,但用户数据报协议(UDP)可以用在会话控制和逻辑组件之间。这些信息将在确定需要部署的位置和类型的服务质量方面发挥重要作用。
• What protocol is used at each connection? For instance, TCP might be used between the Hypertext Markup Language (HTML) Render component and session control, but User Datagram Protocol (UDP) might be used between session control and the logic components. This information will play a large role in determining where and what types of quality of service need to be deployed.
• 政策在哪里以及如何实施?图中只标注了两个策略(与大多数现实世界的应用程序部署相比,这相当简单)——这些策略是由网络、应用程序还是两者实现的?如何实施?
• Where and how is policy implemented? There are only two policies noted on the diagram (this is rather simple compared to most real-world application deployments)—are these policies implemented by the network, the application, or both? How are they to be implemented?
除了这些问题之外,对每个流量的大小进行可靠的估计对于构建应用程序使用网络资源的方式非常有帮助。如果您有每个应用程序的此类信息,那么它可以将这些信息汇总为网络上流量的整体情况,但即使在单个应用程序级别,了解这些类型的流量也很有用。
Beyond these questions, getting a solid estimate of the sizes of each traffic flow could be very helpful in building a picture of the way the application uses network resources. If you have this type of information for each application, it is possible to roll this information up into an overall picture of what the traffic should look like on the network—but even at the individual application level, understanding these sorts of traffic flows can be useful.
从复杂性的角度来看,网络设计人员和架构师面临着非常不确定的环境。虽然这些工具“已经存在”,但它们在管理复杂性方面通常没有被广泛理解,而且它们也不总是得到很好的开发。相反,大多数网络设计人员都根据“经验法则”进行工作,基于“凭感觉行事”,基于多年的无效经验。
Network designers and architects face very uncertain environments from a complexity perspective. While the tools are “out there,” they aren’t often widely understood in terms of managing complexity, nor are they always well developed. Instead, most network designers work from “rules of thumb,” based on “seat of the pants flying,” based on long years of experience with what doesn’t work.
通过从复杂性的角度检查“经验法则”,您可以开始在它们背后提出一些推理——它们为什么有效、在哪里有效以及何时无效。从复杂性的角度审视网络设计问题有助于解开一些谜团,并通过更深入地了解真正的权衡以及技术和模型如何真正发挥作用,为更好的网络设计指明道路。
By examining the “rules of thumb” from a complexity perspective, you can begin to put some reasoning behind them—why they work, where they work, and when they won’t work. Examining the problems of network design from a complexity perspective helps to untangle some of the mysteries, and to point the way toward better network design through a deeper understanding of what the tradeoffs really are, and how the techniques and models really work.
谁关心协议的复杂性?这是一个为那些热衷于基数树和深入数学的极客们保留的话题,对吧?
Who cares about protocol complexity? It’s a topic reserved for the geeks with their heads in radix trees and deep math, right?
错了。
Wrong.
网络上部署的协议,无论是用于控制平面还是通过网络传输数据,实际上本身就是系统(通常是复杂的系统),它们沿着所讨论的相同类型的交互表面与网络中的其他系统进行交互在网络和设计层面。由于这些设计表面,“深入协议”工作的网络工程师需要了解网络设计,就像网络架构人员需要了解协议设计一样。
The protocols deployed on a network, whether used in the control plane or to carry data through the network, are actually systems in their own right—often complex systems—that interact with the other systems in the network along the same sorts of interaction surfaces discussed at the network and design levels. Because of these design surfaces, network engineers working “down in the protocols” need to know network design just as much as network architecture folks need to know protocol design.
本章旨在将复杂性权衡讨论带入协议设计领域,以便网络工程师能够理解为什么一种协议在一种情况下可能效果更好,而其他协议在另一种情况下效果更好。网络工程领域有两种相反的趋势需要警惕:
This chapter aims to bring the complexity tradeoffs discussion to the world of protocol design, so network engineers can understand why one protocol might work better in one situation, and others in another. There are two opposite trends in the network engineering world to guard against:
• 尝试用单一协议解决所有问题;这在 BGP 驱动器中最为明显(截至发稿时)。虽然针对所有事物的单一控制平面(以及一个环来统治所有事物)可以极大地简化网络设计人员的工作,但它也可能会推动次优设计。
• Trying to solve every problem with a single protocol; this is most pronounced in the drive with the BGP (at presstime). While a single control plane for everything (and one ring to rule them all) can greatly simplify the job of the network designer, it can also push suboptimal designs.
• 用自己的协议套件解决每一个问题。
• Solving every problem with its own protocol suite.
这里的权衡并不总是显而易见的,但显然需要在为每个问题部署单点解决方案和为所有问题部署单一解决方案之间取得平衡。这就是网络设计中分层的地方(虚拟化拓扑,以便不同的控制和数据平面看到不同的网络视图)可能非常有用。
The tradeoffs here aren’t always obvious, but there clearly needs to be a balance between deploying a point solution for every problem and deploying a single solution for all problems. This is where layering in a network design (virtualizing topologies so different control and data planes see different views of the network) can be very useful.
本章首先考虑灵活性和复杂性之间的权衡,使用 OSPF 与 IS-IS 作为“经典”示例,说明不同的复杂性权衡如何导致完全不同的协议,每个协议都有自己的一组权衡。第二部分考虑分层与复杂性,特别是根据 John Day 的网络传输迭代模型。第三部分使用两个示例来考虑协议复杂性与设计复杂性;第一个是回归链路状态协议中的微循环和快速收敛概念,第二个考虑了“早年”对 EIGRP 的看法与现实的对比,以及多年来为控制网络复杂性而对协议进行的更改设计。
This chapter begins with a section considering the tradeoff between flexibility and complexity, using OSPF versus IS-IS as the “classic” example of how different complexity tradeoffs lead to completely different protocols, each with their own set of tradeoffs. The second section considers layering versus complexity, specifically in light of John Day’s iterative model of network transport. The third section considers protocol complexity against design complexity using two examples; the first is a return to the concepts of microloops and fast convergence in link state protocols, and the second considers the perception of EIGRP in the “early years” against the reality, and changes made to the protocol over the years to contain complexity in network design.
也许协议设计领域复杂性与灵活性的经典例子就是 OSPF 和 IS-IS 最初的对比设计。需要了解一点历史才能理解与这两个协议相关的讨论和决策的原始背景。
Perhaps the classic example of complexity versus flexibility in the world of protocol design is the original contrasting designs of OSPF and IS-IS. A little bit of history is required to understand the original context of the discussion and decisions that were made in relation to the two protocols.
笔记
Note
IS-IS 是一种广泛部署在大型服务提供商网络中的链路状态协议。该协议的标准规范包含在 ISO 10589 中。1
IS-IS is a link state protocol widely deployed in large-scale service provider networks. The standards specification for the protocol is contained in the ISO 10589.1
1 . 有关 IS-IS 的更多信息,请参阅 Russ White 和 Alvaro Retana,IS-IS:IP 网络中的部署,第 1 版。(波士顿:艾迪生韦斯利,2003 年)。
1. For more information on IS-IS, see Russ White and Alvaro Retana, IS-IS: Deployment in IP Networks, 1st edition. (Boston: Addison-Wesley, 2003).
IS-IS 最初设计用于支持开放系统互连(OSI)协议栈(最初的七层堆栈),该协议栈是围绕终端系统(ES)和中间系统(IS)设计的。每个 ES 都根据设备的第 2 层地址分配一个地址,然后由连接的 IS 在整个网络中进行通告 — 本质上,主机路由是基于 OSI 的网络的正常操作模式,仅在泛洪时进行聚合域边界。所涉及的实际系统是重量较重的计算机(当时),因此具有良好的处理性能和可用内存,因此 IS-IS 的原始设计者更关心设计一种灵活且可扩展至新地址类型的协议(以防 OSI 网络协议采用新型终端主机进行路由,或层2 寻址更改等),以及以最少数量的数据包类型处理 1 级和 2 级洪泛域的能力。
IS-IS was originally designed to support the Open Systems Interconnect (OSI) protocol stack (the original seven-layer stack), which was designed around End Systems (ES) and Intermediate Systems (IS). Each ES was assigned an address based on the Layer 2 address of the device that was then advertised by the connected IS throughout the network—essentially host routing was the normal mode of operation for OSI-based networks, with aggregation only taking place at the flooding domain boundary. The actual systems involved were heavier weight computers (for that time), and hence had a good deal of processing performance and available memory, so the original designers of IS-IS were more concerned about designing a protocol that was flexible and extensible to new address types (in case the OSI networking protocols ever took on a new type of end host to route for, or Layer 2 addressing changed, etc.), as well as the ability to handle level 1 and level 2 flooding domains with a minimal number of packet types.
这种总体前景催生了一个专注于在 TLV 中携带信息的协议;要添加新信息,可以轻松地将携带该信息的新 TLV 添加到协议中。同时,由单个设备发起的所有信息都承载在单个相应的链路状态数据包 (LSP) 中,如果数据包的大小大于网络的最大传输单元 (MTU),则该数据包可能会被分段。
This general outlook led to a protocol that was focused on carrying information in TLVs; to add new information, a new TLV to carry the information could easily be added to the protocol. At the same time, all the information originated by a single device was carried in a single, corresponding, Link State Packet (LSP), which could be fragmented if the size of packet became larger than the Maximum Transmission Unit (MTU) of the network.
另一方面,OSPF 的设计目标略有不同。在设计 OSPF 期间,路由器(尤其是路由器)拥有今天被认为性能非常低的处理器。作为特殊用途的设备,路由器根本不具备任何给定 ES 可能拥有的处理性能或内存,因此,虽然 IS-IS 依靠相对高性能的设备来处理转发,但 OSPF 是围绕相对较低的概念开发的。 - 处理转发的性能设备。
OSPF, on the other hand, was designed with a slightly different set of goals in mind. In the time during which OSPF was designed, routers (in particular) had what would today be considered very low performance processors. As special purpose devices, routers simply were not equipped with the processing performance or the memory that any given ES might have—so while IS-IS counted on a relatively high-performance device handling forwarding, OSPF was developed around the concept of a relatively low-performance device handling forwarding.
为了实现这些目标,OSPF 设计中首先考虑了三个具体要点:
To meet these goals, three specific points came to the forefront in OSPF’s design:
• 固定长度编码。OSPF 不使用 TLV 来编码和携带信息,而是专注于使用固定长度编码,这在线路(可以省去 TLV 标头)和处理(可以使用固定代码块来处理特定数据包)方面需要较少的资源格式,而不是动态读取单个流中携带的各种 TLV 并对其做出反应所需的更长、更复杂的代码块)。
• Fixed length encoding. Rather than using TLVs to encode and carry information, OSPF focused on using fixed length encoding, which requires less resources both on the wire (the TLV header can be dispensed with) and in processing (fixed blocks of code can be used to process specific packet formats, rather than the longer, more complex blocks of code required to dynamically read and react to various TLVs carried within a single stream).
• 修复了较小尺寸的数据包类型。OSPF 专注于较短的数据包(理论上)适合网络上的单个 MTU 数据包,而不是每个路由器都生成一个碎片化的单个“数据包”,并解决同步这些碎片带来的众多麻烦。这使得泛洪和同步变得更加简单。
• Fixed packet types of a smaller size. Rather than each router originating a single “packet” that is fragmented, and the numerous headaches around synchronizing these fragments, OSPF focused on shorter packets that would (in theory) fit in a single MTU packet on the network. This made flooding and synchronization simpler.
• 聚合可达性。在 IP 中,第 3 层寻址与第 2 层寻址无关,这意味着(实际上)可达性是从第一跳开始聚合的。IP 路由器通告子网,子网是一组主机的集合,而不是每个单独的主机。这意味着在设计的初始阶段,OSPF 需要以比 IS-IS 更以 IP 为中心的方式来处理多路访问链路和聚合。
• Aggregated reachability. In IP, Layer 3 addressing is untied from Layer 2 addressing, which means (in practice) that reachability is aggregated from the first hop. IP routers advertise subnets, which are an aggregate of a set of hosts, rather than each individual host. This means OSPF needed to handle multiaccess links and aggregation in a way that was more IP centric than IS-IS in the initial phases of the design.
由于这些目标,OSPF 被设计为具有多种不同的 LSA 类型,每种类型都有固定的格式,并且每种类型都携带不同类型的信息。图 7.1说明了其中的差异。
As a result of these goals, OSPF was designed with a number of different LSA types, each one having a fixed format, and each one carrying a different type of information. Figure 7.1 illustrates the difference.
图 7.1所示的网络仅包含两个路由器,每个路由器都连接到末节链路,并且都连接到公共广播(多点)链路。OSPF 和 IS-IS 都会选举一个伪节点(尽管它称为指定路由器OSPF 中的 [DR])代表广播链路,以减少流经该链路的邻接数量和路由协议流量;这由路由器 B 表示,它较小且呈灰色。将每个 IS-IS TLV 视为单独的数据包(或单个 LSP 的片段),每个协议生成的两组数据包并排设置。如果您了解链路状态路由的概念,那么结果就不足为奇了——两种协议中必须以某种方式携带相同的信息。主要区别在于两种协议如何编码信息。IS-IS 将信息编码在片段内的 TLV 中,而 OSPF 将信息编码在不同类型的数据包中。如果IS-IS中的每个TLV代表OSPF中的单个LSA类型,则相关性几乎是一对一的。
The network illustrated in Figure 7.1 consists of only two routers, each connected to a stub link, and both connected to a common broadcast (multipoint) link. Both OSPF and IS-IS elect a pseudonode (though it’s called a Designated Router [DR] in OSPF) to represent the broadcast link to reduce the number of adjacencies and routing protocol traffic flowing across the link; this is represented by Router B, which is smaller and gray. Treating each IS-IS TLV as a separate packet (or fragment of a single LSP), the two sets of packets generated by each protocol are set side-by-side. If you understand the concept of link state routing, the result isn’t very surprising—the same information must somehow be carried in both protocols. The main difference is in how the two protocols encode the information. IS-IS encodes the information in TLVs within fragments, while OSPF encodes the information in different types of packets. If each TLV in IS-IS represents a single LSA type in OSPF, the correlation is almost one-to-one.
OSPF 权衡了编码和传输的复杂性与固定长度数据包的关系。这就引出了一个问题:使用 TLV 给 IS-IS 带来什么优势?TLV的代价是数据包尺寸更大,处理更复杂;IS-IS 如何通过这种方式增加数据包格式的复杂性来对抗 OSPF?
OSPF traded off complexity in coding and transport against fixed length packets. This leads to the question—what advantage does using TLVs give IS-IS? The cost of the TLV is a larger packet size and more complex processing; how does IS-IS gain against OSPF for increasing the packet formatting complexity in this way?
最明显的答案是能够快速、轻松地向 IS-IS 添加新型信息。例如,当 IPv6 开发出来时,这两种链路状态协议都需要支持新的 IP 格式。IS-IS 只是添加了一组新的 TLV,而 OSPF 则需要一个全新的协议——OSPFv3。固定长度格式的缺点是显而易见的。一般来说,IS-IS 增加了TLV 类型,而OSPF 增加了报文类型。
The most obvious answer is in the ability to quickly and easily add new types of information to IS-IS. For instance, when IPv6 was developed, both of these link state protocols needed to support the new IP format. IS-IS simply added a new set of TLVs, while OSPF required an entirely new protocol—OSPFv3. The downside of fixed length formats is immediately obvious. In general, IS-IS proliferates TLV types, while OSPF proliferates packet types.
笔记
Note
这是为了说明特定点而进行了一些简化;TLV 格式并不是创建 OSPFv3 而不是简单地将 IPv6 添加到 OSPFv2 的唯一原因。例如,OSPFv2 跨 IPv4 层建立邻接关系并承载 LSA,因此 OPSFv2 中的下一跳是 IPv4 地址,这在尝试部署 IPv6 时会引起一些麻烦,因为 IPv6 自然应该使用 IPv6 下一跳。与直接与第 2 层对等的 IS-IS 相比,OSPF 与底层第 3 层传输的联系更为紧密。
This is a bit simplified to illustrate a specific point; the TLV formatting is not the only reason why OSPFv3 was created rather than simply adding IPv6 to OSPFv2. For instance, OSPFv2 builds adjacencies, and carries LSAs, across layer IPv4, so the next hops in OPSFv2 are IPv4 addresses—which would cause some headaches when trying to deploy IPv6, which should naturally use IPv6 next hops. OSPF is more intimately tied to the underlying Layer 3 transport than IS-IS, which peers directly at Layer 2.
OSPF 好还是 IS-IS 好?这个问题没有明确的答案——很大程度上取决于协议的用途、网络运营商的技能以及个人的实现。在现实世界中,多年前根据每个协议设计的不同环境所做的权衡已经被事件所克服;任何关于这两个协议的辩论都必须达到详细程度,才能真正找到值得一提的差异,这一点很明显,细节并不重要。
Is OSPF better, or IS-IS? There’s no clear cut answer to this question—a lot depends on the use to which the protocol is being put, the skills of the network operators, and the individual implementations. In the real world, the tradeoffs made years ago based on the different environments each protocol was designed for have been overcome by events; that the details hardly matter is apparent in the level of detail any debate about the two protocols must reach to actually find a difference worth mentioning.
图 7.2展示了两个协议栈;哪一个更简单?
Figure 7.2 illustrates two protocol stacks; which one is simpler?
如果您像大多数网络工程师一样,您可能对此图有以下两种反应之一:
If you’re like most network engineers, you’re likely to have one of two reactions to this illustration:
• XYZ 协议,因为堆栈中只有一种协议,并且一种协议总是比许多交互协议更简单。
• The XYZ protocol, because there’s only one protocol in the stack—and one protocol is always simpler than many protocols interacting.
• XYZ 协议栈,因为分层很好。传输协议(例如 TCP 和 IP)不也是这样分层的吗?
• The XYZ protocol stack, because layering is good. Aren’t transport protocols (such as TCP and IP) layered like this?
要么您认为分层是好的,因为每个网络工程师在第一次使用控制台时都已将它们深深地铭记在心,要么分层是一种必要的罪恶,让您在坐下来参加认证测试时很难记住分层或大学考试。是哪一个?
Either you believe that layers are good because they’ve been drilled into the head of every network engineer because of their first console experience, or layers are a necessary evil that make it hard to remember when you’re sitting down to take a certification test or college exam. Which is it?
两者都不是。
It’s neither.
哪一个更复杂实际上取决于几个因素,包括:
Which one is more complex actually depends on several factors, including:
• 协议栈的总功能是什么。改变插图的背景会改变对问题的看法;如果有人说,“XYZ 协议栈是当前传输栈中 IP 组件的替代品,将该栈中的层从四层增加到六层”,你会立即认为“太复杂了”。如果有人说,“XYZ 协议用单个隧道协议取代了当前在网络中运行的三层隧道”,您可能会立即大喊“万岁!”
• What the total functionality of the stack of protocols is. Changing the context of the illustration changes the perception of the problem; if someone said, “the XYZ protocol stack is a replacement for the IP component in the current transport stack, increasing the layers in that stack from four to six,” you’d immediately think “too complex.” If someone said, “the XYZ protocol replaces three layers of tunnels currently running in the network with a single tunneled protocol,” you’d probably immediately yell, “hurrah!”
• 堆栈中协议之间的交互界面定义得如何。具有明确定义的层的协议栈可以形成明确定义的交互界面。明确定义的交互表面仍然是有漏洞的抽象,但更好的定义往往会减少泄漏。明确定义的交互表面仍然是增加复杂性的交互表面,但明确定义的深度和范围都有助于减少跨层抽象泄漏,并且更容易理解交互以进行修改和故障排除。
• How well defined the interaction surfaces are between the protocols in the stack. Protocol stacks with well-defined layers make for well-defined interaction surfaces. Well-defined interaction surfaces are still leaky abstractions, but better definition tends to reduce the leaks. Well-defined interactions surfaces are still interaction surfaces that add complexity, but well-defined depth and scope both contribute to less cross-layer abstraction leaks, and make it easier to understand the interactions for modifications and troubleshooting.
因此,从复杂性的角度来看,分层可能是一件好事,也可能是一件坏事。这完全取决于分层如何影响以下方面:
Layering, then, can either be a good thing or a bad thing from a complexity standpoint. It all depends on how the layering impacts the following:
•状态:添加一层实际上是否以某种有意义的方式划分状态?或者附加的层只是简单地分散通常跨越分层边界“一起作用”的状态,从而使交互表面更加复杂?
• State: Does the addition of a layer actually divide state in some meaningful way? Or are the additional layers simply spreading out state that often “acts together” across the layering boundary, making the interaction surface more complex?
•速度:附加层是否会降低一层必须运行的速度,或者将两个具有完全不同“运行节奏”的进程分开?
• Speed: Does the additional layer reduce the speed at which one layer must operate, or separate two processes with completely different “operation tempos”?
•表面:附加层是否会产生一组干净的交互点并最大限度地减少泄漏抽象?是否只有少数几个地方会产生意想不到的后果,还是很多地方?
• Surface: Does the additional layer produce a set of clean interaction points and minimize leaky abstractions? Are there just a few places where unintended consequences can come into play, or many?
笔记
Note
意外后果的概念与辅助性概念密切相关,第 10 章“可编程网络复杂性”对此进行了更详细的介绍。
The concept of unintended consequences is closely tied to the concept of subsidiarity, which is covered in more detail in Chapter 10, “Programmable Network Complexity.”
网络工程师处理上述有关协议栈分层问题的方法之一是根据模型构建协议栈。两种常见的模型是四层DoD 模型和七层OSI 模型。虽然大多数网络工程师至少熟悉其中一种模型,但可能需要快速回顾一下。
One of the ways network engineers have dealt with the questions raised above about layering in protocol stacks is by building protocol stacks according to models. Two common models are the four-layer DoD model, and the seven-layer OSI model. While most network engineers are familiar with at least one of these models, a quick review might be in order.
事实上,任何参加过网络课程或学习过网络工程认证的人都熟悉使用七层模型来描述网络的工作方式。无连接网络协议 (CLNP) 和路由协议 IS-IS 是由 ISO 设计的,以满足七层模型中给出的要求。这些协议仍在广泛使用,特别是 IS-IS,它已被修改以支持 IP 网络中的路由。图 7.3说明了七层模型。
Virtually anyone who has ever been through a networking class or studied for a network engineering certification is familiar with using the seven-layer model to describe the way network’s work. Connectionless Networking Protocol (CLNP) and a routing protocol, IS-IS, were designed by the ISO to meet the requirements given within the seven-layer model. These protocols are still in wide use, particularly IS-IS, which has been modified to support routing in IP networks. Figure 7.3 illustrates the seven-layer model.
每对层在模型中垂直移动,并通过 API 进行交互。因此,要连接到特定的物理端口,数据链路层的一段代码将连接到该端口的套接字。这使得各层之间的交互能够被抽象和标准化。一块网络层软件不需要知道如何处理各种物理接口,只需要知道如何将数据传送到同一系统上的数据链路层软件即可。
Each pair of layers, moving vertically through the model, interacts through an API. So to connect to a particular physical port, a piece of code at the data link layer would connect to the socket for that port. This allows the interaction between the various layers to be abstracted and standardized. A piece of software at the network layer doesn’t need to know how to deal with various sorts of physical interfaces, only how to get data to the data link layer software on the same system.
每层都有一组特定的功能要执行:
Each layer has a specific set of functions to perform:
• 物理层(第1 层)负责将0 和1 调制或串行化到物理链路上。每种链路类型都有不同的格式来发送 0 或 1;物理层负责将 0 和 1 转换为这些物理信号。
• The physical layer (Layer 1) is responsible for getting the 0s and 1s modulated, or serialized, onto the physical link. Each link type has a different format for signaling a 0 or 1; the physical layer is responsible for translating 0s and 1s into these physical signals.
• 数据链路层负责确保传输的信息发送到链路另一端的正确计算机。每个设备都有不同的数据链路(第 2 层)地址,可用于将流量发送到该特定设备。数据链路层假设信息流中的每个帧与同一流中的所有其他数据包分开,并且仅为通过单个物理链路连接的设备提供通信。
• The data link layer is responsible for making certain that transmitted information is sent to the right computer on the other side of the link. Each device has a different data link (Layer 2) address that can be used to send traffic to that specific device. The data link layer assumes that each frame within a flow of information is separate from all other packets within that same flow and only provides communication for devices that are connected through a single physical link.
• 网络层负责在未通过单个物理链路连接的系统之间传输数据。然后,网络层提供网络范围(或第 3 层)地址,而不是链路本地地址地址,并且还提供一些方法来发现到达这些目的地必须跨越的设备和链路集。
• The network layer is responsible for transporting data between systems that are not connected through a single physical link. The network layer, then, provides network-wide (or Layer 3) addresses, rather than link local addresses, and also provides some means for discovering the set of devices and links that must be crossed to reach these destinations.
• 传输层(第4 层)负责不同设备之间数据的透明传输。传输层协议可以是“可靠的”,这意味着传输层将重新传输在某个较低层丢失的数据,也可以是“不可靠的”,这意味着在较低层丢失的数据必须由某个较高层的应用程序重新传输。
• The transport layer (Layer 4) is responsible for the transparent transfer of data between different devices. Transport layer protocols can either be “reliable,” which means the transport layer will retransmit data lost at some lower layer, or “unreliable,” which means that data lost at lower layers must be retransmitted by some higher-layer application.
• 会话层(第5 层)并不真正传输数据,而是管理运行在两台不同计算机上的应用程序之间的连接。会话层确保数据的类型、数据的形式和数据流的可靠性都被暴露和考虑。
• The session layer (Layer 5) doesn’t really transport data, but manages the connections between applications running on two different computers. The session layer makes certain that the type of data, the form of the data, and the reliability of the data stream are all exposed and accounted for.
• 表示层(第6 层)以运行在两个设备上的应用程序可以理解和处理的方式格式化数据。加密、流量控制以及在应用程序和网络之间提供接口所需的任何其他数据操作都发生在这里。应用程序通过套接字与表示层交互。
• The presentation layer (Layer 6) formats data in a way that the application running on the two devices can understand and process. Encryption, flow control, and any other manipulation of data required to provide an interface between the application and the network happen here. Applications interact with the presentation layer through sockets.
• 应用层(第7 层)提供用户和应用程序之间的接口,应用程序又通过表示层与网络进行交互。
• The application layer (Layer 7) provides the interface between the user and the application, which in turn interacts with the network through the presentation layer.
模型中的每一层都提供其下面的层所携带的信息;例如,第 3 层提供第 2 层使用第 1 层封装和传输的位。这导致了以下观察:不仅可以在七层模型中以精确的术语描述各层之间的交互,而且在可以精确地描述多台计算机。第一设备上的物理层可以与第二设备上的物理层通信,第一设备上的数据链路层与第二设备上的数据链路层通信,等等。正如设备上两层之间的交互是通过套接字处理的一样,不同设备上并行层之间的交互也是通过网络协议处理的:
Each layer in the model provides the information the layer below it is carrying; for instance, Layer 3 provides the bits Layer 2 encapsulates and transmits using Layer 1. This leads to the following observation: not only can the interaction between the layers be described in precise terms within the seven-layer model, the interaction between parallel layers on multiple computers can be described precisely. The physical layer on the first device can be said to communicate with the physical layer on the second device, the data link layer on the first device with the data link layer on the second device, and so on. Just as interactions between two layers on a device are handled through sockets, interactions between parallel layers on different devices are handled through network protocols:
• 以太网描述了物理线路上的0 和1 信号、启动和停止数据帧的格式以及在连接到单条线路的所有设备中对单个设备进行寻址的方法。因此,以太网属于 OSI 模型中的第 1 层和第 2 层。
• Ethernet describes the signaling of 0s and 1s onto a physical piece of wire, a format for starting and stopping a frame of data, and a means of addressing a single device among all the devices connected to a single wire. Ethernet, then, falls within both Layer 1 and Layer 2 in the OSI model.
• IP 描述了将数据格式化为数据包的方式,以及通过多个第 2 层链路发送数据包以到达几跳之外的设备所需的寻址和其他方法。因此,IP 属于 OSI 模型的第 3 层。
• IP describes the formatting of data into packets, and the addressing and other means necessary to send packets across multiple Layer 2 links to reach a device that is several hops away. IP, then, falls within Layer 3 of the OSI model.
• TCP 描述会话设置和维护、数据重传以及与应用程序的交互。因此,TCP 属于 OSI 模型的传输层和会话层。
• TCP describes session setup and maintenance, data retransmission, and interaction with applications. TCP, then, falls within the transport and session layers of the OSI model.
图 7.3对此进行了说明,其中每个“逐层”交互都使用用于描述传输的信息块的名称进行标记。例如,段从一个设备上的传输层传输到另一设备,而数据包从一个设备上的网络层传输到另一设备。
This was illustrated in Figure 7.3, where each “layer-wise” interaction is labeled using the name used to describe the blocks of information transferred. For instance, segments are transferred from the transport layer on one device to another, while packets are transferred from the network layer on one device to another.
工程师们很难将 IP 传输堆栈融入七层模型的原因之一是,它是在开发时考虑到了不同的网络通信模型。IP 传输堆栈并非采用七层模型,而是围绕四层模型进行设计,RFC 1122 中对此进行了粗略描述,并如图7.4所示。
One of the reasons why engineers have so much difficulty fitting the IP transport stack into the seven-layer model is that it was developed with a different model of network communications in mind. Instead of a seven-layer model, the IP transport stack was designed around a four-layer model, roughly described in RFC 1122 and illustrated in Figure 7.4.
在此模型中:
In this model:
• 链路层大致负责与OSI 模型中的物理和数据链路层相同的功能——控制物理链路的使用、链路本地寻址以及在单个物理链路上将帧作为单个位传送。
• The link layer is roughly responsible for the same functions as the physical and data link layers in the OSI model—controlling the use of physical links, link-local addressing, and carrying frames as individual bits across an individual physical link.
• 互联网层大致负责与 OSI 模型中的网络层相同的任务 — 提供跨多个物理链路的寻址和可达性,并提供单一数据包格式和接口,而不管实际的物理链路类型如何。
• The Internet layer is roughly responsible for the same things as the Network layer in the OSI model—providing addressing and reachability across multiple physical links and providing a single packet format and interface regardless of the actual physical link type.
• 传输层负责在通信设备之间建立和维护会话,并为数据流或数据块提供通用的透明数据传输机制。流量控制和可靠传输也可以在这一层中实现,就像 TCP 的情况一样。
• The transport layer is responsible for building and maintaining sessions between communicating devices and providing a common transparent data transmission mechanism for streams or blocks of data. Flow control and reliable transport may also be implemented in this layer, as in the case of TCP.
• 应用层是用户与网络资源或使用数据并向连接到网络的其他设备提供数据的特定应用程序之间的接口。
• The application layer is the interface between the user and the network resources, or specific applications that use and provide data to other devices attached to the network.
在此模型中,以太网完全适合链路层;IP和所有路由协议都适合互联网层;TCP 和 UDP 属于传输层。由于这种简洁性,这是一个用于理解当今部署的 IP 的更清晰的模型,尽管它没有提供划分各个级别信令的详细结构,而这在更加以研究为导向的环境中可能有用。
In this model, Ethernet fits wholly within the link layer; IP and all routing protocols fit in the Internet layer; and TCP and UDP fit within the transport layer. Because of this neatness, this is a cleaner model for understanding IP as its deployed today, although it doesn’t provide the detailed structure of splitting up the various levels of signaling that might be useful in a more research-oriented environment.
七层和四层模型依赖于这样的概念:当您将堆栈从物理层向上移动到应用程序时,每一层都会添加一些特定的功能或一组相关的功能。此功能通常与网络组件相关,例如单个链路、端到端路径或应用程序到应用程序的通信。通过将功能分组为层,可以最小化堆栈中每个协议的复杂性,并且可以定义和管理堆栈内每个协议或每个协议层之间的交互表面。
The seven- and four-layer models rely on the concept that as you move up the stack from the physical to the application, each layer adds some specific function, or a set of related functions. This functionality is generally tied to a component of the network, such as a single link, an end-to-end path, or application to application communication. By grouping functionality into layers, the complexity of each protocol in the stack can be minimized, and the interaction surface between each protocol or each layer of protocols within the stack can be defined and managed.
为了更全面地理解这些概念,查看网络中协议栈的最终模型很有用。虽然该模型不像七层和四层模型那样流行或广泛使用,但它更清楚地说明了复杂性和分层之间的权衡。
To understand these concepts a little more fully, it’s useful to look at one final model for protocol stacks in networks. While this model isn’t as popular or used as widely as the seven-and four-layer models, it more clearly illustrates the tradeoff between complexity and layering.
如果你检查一下上面协议栈中每一层的实际功能,你会发现它们之间有很多相似之处。例如,以太网数据链路层提供跨单个链路的传输和复用,而IP提供跨多跳路径的传输和复用。这导致了以下观察:任何数据承载协议实际上只有四个功能可以服务:传输、复用、纠错和流量控制。2这四种功能有两种自然分组:传输和复用、错误和流量控制。因此,大多数协议都会做以下两件事之一:
If you examine the actual function of each layer in the protocol stacks above, you’ll find many similarities between them. For instance, the Ethernet data link layer provides transport and multiplexing across a single link, and IP provides transport and multiplexing across a multihop path. This leads to the following observation: There are really only four functions that any data carrying protocol can serve: transport, multiplexing, error correction, and flow control.2 There are two natural groupings within these four functions: transport and multiplexing, error and flow control. So most protocols fall into doing one of two things:
2 . John Day,网络架构模式:回归基础(新泽西州上萨德尔河;伦敦:Prentice Hall,2008 年)。
2. John Day, Patterns in Network Architecture: A Return to Fundamentals (Upper Saddle River, N.J.; London: Prentice Hall, 2008).
• 该协议提供传输,包括从一种数据格式到另一种数据格式的某种形式的转换,以及多路复用,该协议能够将来自不同主机和应用程序的数据分开。
• The protocol provides transport, including some form of translation from one data format to another, and multiplexing, the ability of the protocol to keep data from different hosts and applications separate.
• 该协议通过纠正小错误或重新传输丢失或损坏的数据的能力来提供错误控制,并提供流量控制,以防止由于网络传输数据的能力与应用程序生成数据的能力之间不匹配而导致的不当数据丢失。
• The protocol provides error control, either through the capability to correct small errors or to retransmit lost or corrupted data, and flow control, which prevents undue data loss because of a mismatch between the network’s capability to deliver data and the application’s capability to generate data.
从这个角度来看,以太网提供传输服务和流量控制,因此它是一个混合包,集中在网络内的单个链路、端口到端口(或隧道端点到隧道端点)上。IP 是一种提供传输服务的多跳协议(跨越多个物理链路的协议),而 TCP 是一种使用 IP 传输机制并提供纠错和流量控制的多跳协议。图 7.5说明了迭代模型。
From this perspective, Ethernet provides transport services and flow control, so it is a mixed bag concentrated on a single link, port to port (or tunnel endpoint to tunnel endpoint) within a network. IP is a multihop protocol (a protocol that spans more than one physical link) providing transport services, whereas TCP is a multihop protocol that uses IP’s transport mechanisms and provides error correction and flow control. Figure 7.5 illustrates the iterative model.
该模型中的每一层将管理给定范围内的一组参数所需的信息分组到单个协议中。
Each layer in this model groups the information required to manage a single set of parameters within a given scope into a single protocol.
关于网络复杂性,特别是在设计领域,这一对协议设计和分层的小讨论告诉我们什么?首先,分层是一种经过时间考验、行之有效的方法,通过划分功能来将复杂性与复杂性分开跨越 API 边界。其次,三个模型的比较表明,将层与其所做的事情(而不是它们所在的位置)相匹配,可以提供更清晰且更易于处理的操作模型。虽然四层和七层模型有效,但它们既没有真正描述每一层的协议正在做什么,也没有真正建议将来将新协议插入堆栈的方法。另一方面,迭代模型提供了这两者。
What does this little excursion into protocol design and layering tell us about network complexity, particularly in the area of design? First, layering is a time tested, proven way to separate complexity from complexity by dividing functions across an API boundary. Second, the comparison of the three models shows that matching the layers to what they do, rather than where they are, provides a cleaner and easier to handle model of operations. While the four- and seven-layer models work, they neither really describe what the protocols at each layer are doing, nor do they really suggest a way to insert new protocols into the stack in the future. The iterative model, on the other hand, provides both of these.
然而,协议栈不仅仅是独立的系统,它们还通过应用程序性能、可管理性和设计与更大的网络环境进行交互。本节更深入地重点讨论网络设计和协议复杂性之间的权衡,以使网络工程师更好地了解这些领域的权衡。具体来说,协议复杂性与设计复杂性如何权衡?
Protocols stacks aren’t just self-contained systems, however, they interact with the larger network environment through application performance, manageability, and design. This section focuses on the tradeoff between network design and protocol complexity a little more deeply, to give network engineers a better feel for the tradeoffs in these areas. Specifically, how does protocol complexity tradeoff against design complexity?
随着协议变得更加复杂,一些设计元素变得不那么复杂。另一方面,随着协议变得更加复杂,协议的故障排除和管理也变得更加复杂。每个动作都有一个相等且相反的反应,随着整个系统中每个组件的复杂性级别的上升和下降,创建一个动作和反应链。工程师需要意识到,一旦他们“把它扔到隔间墙上”,复杂性并不会“消失”。其他一些系统必须承担这种复杂性并对其进行管理。
As the protocol takes on more complexity, some elements of design become less complex. On the other hand, as the protocol becomes more complex, troubleshooting and managing the protocol becomes more complex as well. Each action has an equal and opposite reaction, creating a chain of action and reaction as the complexity levels of each component in the overall system rise and fall. Engineers need to be aware that the complexity doesn’t just “go away,” once they’ve “thrown it over the cubicle wall.” Some other system must pick that complexity up and manage it.
协议复杂性和网络复杂性之间的权衡的第一个例子是第 5 章“设计复杂性”中已经讨论过的概念。本节将深入探讨快速重路由的协议复杂性部分。这里将使用图7.6作为说明。
The first example of this tradeoff between protocol complexity and network complexity is a concept that’s already been discussed in Chapter 5, “Design Complexity.” This section will dig a little deeper in the protocol complexity pieces of fast reroute. Figure 7.6 will be used as an illustration here.
假设该网络以以下状态开始:
Assume that this network starts with the state:
• 路由器 B 到 2001:db8:0:1::/64 的最短路径是经过 A。
• Router B’s shortest path to 2001:db8:0:1::/64 is through A.
• 路由器 C 到 2001:db8:0:1::/64 的最短路径是经过 B。
• Router C’s shortest path to 2001:db8:0:1::/64 is through B.
• 路由器 D 到 2001:db8:0:1::/64 的最短路径是经过 C。
• Router D’s shortest path to 2001:db8:0:1::/64 is through C.
• 路由器 E 到 2001:db8:0:1::/64 的最短路径是经过 A。
• Router E’s shortest path to 2001:db8:0:1::/64 is through A.
• 所有路由器都运行链路状态路由协议(OSPF 或IS-IS)。
• All routers are running a link state routing protocol (either OSPF or IS-IS).
当[B,C]链接失效时,会产生什么结果?要回答这个问题,请考虑:
When the [B,C] link fails, what will be the result? To answer this, consider:
• 路由器B 和C 将立即收到链路故障通知,并且本地进程将从本地路由表中删除受影响的目的地。
• Routers B and C will be notified of the link failure immediately, and local processes will remove the impacted destinations from the local routing table.
• 路由器B 和C 因为首先收到故障通知,所以将计算一棵新树并首先切换到新路由。
• Routers B and C, because they’ve been notified of the failure first, will calculate a new tree and switch to the new routes first.
• 路由器D 和A 将立即收到故障通知,计算新树并切换到新路由。
• Routers D and A will be notified of the failure second, calculate a new tree, and switch to the new routes.
• 路由器E 将最后收到故障通知,因此将最后计算并安装任何新的必要路由信息。
• Router E will be notified of the failure last, and hence will calculate and install any new necessary routing information last.
笔记
Note
在链路状态协议中,洪泛是与计算 SPT 分开的过程;为了方便起见,上面的描述同时考虑了它们两者。
In link state protocols, flooding is a separate process from calculating the SPT; for convenience, the description above considers them both at the same time.
结合网络的原始状态和[B,C]链路故障时的收敛顺序,可以很明显地看出微环发生在哪里。当Router C发现故障时,它会迅速重新计算SPT,发现经过Router D的路径是新的最短(无环)路径。当RouterC通过RouterD转发时,RouterD仍在发现故障并重新计算新的最短路径3。在时间差期间Router C 和 Router D 的重新计算,目的地为 2001:db8:0:1::/64 的流量将在两个路由器之间循环。
Combining the original state of the network with the order of convergence when the [B,C] link fails, it’s obvious where the microloop occurs. When Router C discovers the failure, it quickly recomputes the SPT and discovers that the path through Router D is the new shortest (loop free) path. While Router C is forwarding through Router D, Router D is still discovering the failure and recalculating a new shortest path three. During the differential in time between Router C’s recalculation and Router D’s, traffic destined to 2001:db8:0:1::/64 will be looped between the two routers.
为什么不设置一个计时器,让路由器 C 等到路由器 D 重新计算后安装新路由?这根本行不通。网络中的每个路由器必须以相同的方式对收到的信息做出反应;链路状态协议的全部要点是确保参与控制平面的每个设备都具有网络状态的单一视图,并且以相同的方式运行。这是 OSPF 和 IS-IS 等链路状态控制平面确定性的基础之一。在Router C 上设置等待收敛的定时器,就是在Router D 上设置同样的定时器。该解决方案所做的只是减慢整个网络的收敛速度,而不是真正解决微环路。
Why not set a timer so Router C waits until Router D has recalculated to install the new routes? This simply won’t work. Every router in the network must react the same way to received information; the entire point of a link state protocol is to ensure that every device participating in the control plane has a single view of the state of the network, and acts in the same way. This is one of the foundations of the determinism of link state control planes like OSPF and IS-IS. Setting up a timer during which Router C must wait to converge means setting the same timer on Router D; all this solution does is to slow the total network convergence down, rather than actually resolving the microloop.
笔记
Note
可以强制链路状态控制平面中的路由器命令安装前缀以防止微环;然而,它并不像使用计时器那么简单。相反,每个路由器必须计算其距网络变化的距离,并计算在安全地安装到本地路由表中每个目的地的新最短路径之前必须等待的时间。这个过程确实可以防止微循环,但代价是收敛速度。RFC6976 描述了有序转发信息库 (FIB) 解决方案,以及可以减轻某些收敛速度影响的消息完成系统。
It is possible to force the routers in a link state control plane to order the installation of the prefixes to prevent microloops; however, it’s not as simple as using timers. Instead, each router must calculate its distance from the network change, and calculate the amount of time it must wait before it can safely install new shortest paths to each destination in the local routing table. This process does prevent microloops, but at a cost in terms of convergence speed. RFC6976 describes the ordered Forwarding Information Base (FIB) solution, and a message of completion system that can alleviate some of the convergence speed impact.
为什么不强制链路状态网络中的所有路由器同时重新计算呢?当然,必须有一种方法来同步参与控制平面的所有设备的所有时钟,以便它们都同时开始计算,或者它们都同时安装其路由。但是,您能让数千个分布式节点上的所有设备上的所有计时器同步到什么程度呢?答案是:不能精确到微秒。即使你可以将所有时钟同步到这种精度,也几乎不可能确保每个路由器在计算后同时安装路由。有的路由器具有分布式转发平面,有的路由器则将转发与路由计算统一起来。在具有分布式转发平面的路由器中,每条路径在控制平面和转发硬件之间都有不同的路径,并且每条路径都将花费不同的时间来安装路由。即使在具有相同分布式系统的路由器中,每个路由器上的处理器负载也会随时变化,这再次影响了计算后安装路由所需的时间。
Why not just force all the routers in a link state network to recalculate at the same time? Surely there must be a way to synchronize all the clocks across all the devices participating in the control plane so they all begin calculating at the same time, or they all install their routes at the same time. But how closely can you get all the timers on all the devices across thousands of distributed nodes synchronized? The answer is: not down to the microseconds. Even if you could synchronize all the clocks to this level of precision, it would be almost impossible to ensure every router installs the route at the same time after the calculation. Some routers have distributed forwarding planes, while others unify forwarding with route calculation. Among routers that have distributed forwarding planes, each one is going to have different paths between the control plane and the forwarding hardware, and each of those paths are going to take a different amount of time to install the route. Even among routers with identical distributed systems, the processor load on each router will vary moment to moment, which again impacts the amount of time it takes to install a route once it’s been calculated.
以下各节将考虑该问题的几个答案。
Several answers to this problem are considered in the following sections.
无环路替代方案 (LFA) 可能在 JJ Garcia 的一篇论文中首次在学术上得到概述,该论文成为 Cisco EIGRP 的基础。3 LFA 的概念依赖于简单的几何结构,如图7.7所示。
Loop Free Alternates (LFAs) were probably first outlined academically in a paper by J. J. Garcia—the paper that became the foundation for Cisco’s EIGRP.3 The concept of an LFA relies on simple geometry, as shown in Figure 7.7.
3 . JJ Garcia-Lunes-Aceves,“使用扩散计算的无环路路由”,IEEE/ACM Transactions on Networking 1,第 1 期。1(1993 年 2 月):130-141。
3. J. J. Garcia-Lunes-Aceves, “Loop-Free Routing Using Diffusing Computations,” IEEE/ACM Transactions on Networking 1, no. 1 (February 1993): 130–141.
在图7.7所示的网络中,RouterA到2001:db8:0:1::/64之间有3条路径:
In the network illustrated in Figure 7.7, there are three paths between Router A and 2001:db8:0:1::/64:
• [A,B,E],度量为 4
• [A,B,E] with a metric of 4
• [A,C,E],度量为 5
• [A,C,E] with a metric of 5
• [A,D,E],度量为 6
• [A,D,E] with a metric of 6
要理解 LFA,还需要注意三个指标:
To understand LFAs, three more metrics need to be noted:
• [B,E] 度量为 2
• [B,E] with a metric of 2
• [C,E] 度量为 3
• [C,E] with a metric of 3
• [D,E] 度量为 4
• [D,E] with a metric of 4
LFA 的概念来自于这个简单的观察:“通过我”循环的任何路径的成本不能低于达到相同路径的“我的成本”目的地。换句话说,如果一条路径是“通过我”的环路,则邻居通告该路由的成本必须大于或等于到达同一目的地的本地最佳路径。
The concept of a LFA comes from this simple observation: the cost of any path that loops “through me” cannot be less than “my cost” to reach that same destination. To put this in other terms—if a path is a loop “through me,” the cost of the neighbor advertising the route must be more than or equal to the local best path to reach that same destination.
如图7.7所示,从RouterA的角度来看:
In terms of Figure 7.7, from the perspective of Router A:
• 如果从路由器A 到达2001:db8:0:1::/64 的路径成本大于4,则可能是环路。
• If the cost of a path to reach 2001:db8:0:1::/64 from Router A is greater than 4, then it might be a loop.
• 如果到达2001:db8:0:1::/64 的路径成本小于4,则它不可能是通过路由器A 本身返回的环路。
• If the cost of a path to reach 2001:db8:0:1::/64 is less than 4, then it cannot be a loop passing back through Router A itself.
有了这些信息,路由器 A 现在可以从每个连接的邻居的角度检查到达 2001:db8:0:1::/64 的成本,以确定它们到达该目的地的路径是否是通过路由器的环回A 本身与否。检查每个可用路径:
With this information in hand, Router A can now examine the cost to reach 2001:db8:0:1::/64 from the perspective of each of its connected neighbors to determine if their path to reach this destination is a loop back through Router A itself or not. Examining each of the paths available:
• (A,B,E):这是最佳(成本最低)路径,因此不需要检查。
• (A,B,E): This is the best (lowest cost) path, so it does not need to be examined.
• (A,C,E):Router C(Router A 的邻居)的 Cost 为 3,Router A 的最佳路径为 4。由于 Router C 的 Cost 小于 Router A 的 Cost,因此该路径不能通过 Router 环回A 本身。因此,该路径是无环路的(从路由器 A 的角度来看),因此是有效的无环路备用路径。
• (A,C,E): The cost at Router C (Router A’s neighbor) is 3, and the best path at Router A is 4. Because Router C’s cost is less than Router A’s cost, this path cannot loop back through Router A itself. Hence, this path is loop free (from Router A’s perspective), and is therefore a valid loop free alternate path.
• (A,D,E):路由器 D 的成本为 4,路由器 A 的最佳路径为 4。由于这两个成本相等,因此流量有可能/可能转发到 2001:db8:0:1 ::/64 可以在某些拓扑更改事件期间转发回路由器 A(这是微循环的另一个实例)。因此,EIGRP 会将此路径声明为循环路径,而链路状态协议(OSPF 和 IS-IS)将确定这是一条通往目的地的可行的无循环备用路径。
• (A,D,E): The cost at Router D is 4, and the best path at Router A is 4. Because these two costs are equal, it is possible/probable that traffic forwarded to 2001:db8:0:1::/64 could be forwarded back to Router A during certain topology change events (this is another instance of a microloop). Because of this, EIGRP would declare this path as being a looped path, while link state protocols (OSPF and IS-IS) would determine this is a viable loop free alternate path to the destination.
笔记
Note
为了完整起见,路由器 A 处的最佳路径度量在 EIGRP 中称为“可行距离”,而路由器 B、C 和 D 处的度量则称为路由器 A 处的报告距离 — 因为这是成本(或距离)或度量)他们已向路由器 A 报告。
Just to be complete, the metric of the best path at Router A is called the Feasible Distance in EIGRP, and the metric at Routers B, C, and D is called the Reported Distance at Router A—because this is the cost (or distance or metric) they’ve reported to Router A.
从协议的角度来看,当使用 LFA 提供快速重新路由时,在复杂性方面有哪些权衡?
What are the tradeoffs, in terms of complexity, when using LFAs to provide fast reroute, from a protocol perspective?
•状态(携带):控制平面协议本身没有携带额外的状态来提供无环路替代。对于链路状态协议,只需从邻居的角度运行 SPF,即可根据链路状态数据库计算到达邻居目的地的成本。对于 EIGRP,邻居的成本只是原始路由通告中包含的度量(不添加已连接链路的成本)。
• State (carried): There is no additional state carried in the control plane protocol itself to provide the loop free alternate. For link state protocols, the cost to reach a destination at a neighbor can be computed from the link state database by simply running SPF from the neighbor’s perspective. For EIGRP, the cost at the neighbor is simply the metric contained in the original route advertisement (without the cost of the connected links added in).
•状态(本地):每个路由器或参与控制平面的每个设备内都有一些附加状态。对于链路状态协议,SPF 不仅必须在网络的本地视图上运行,而且还必须为每个邻居运行一次,以获得每个邻居到任何给定目的地的成本。备份路径在计算后还必须以某种方式存储,以便在主路径发生故障时转发平面可以快速切换到它。
• State (local): There is some additional state within each router, or each device participating in the control plane. For a link state protocol, SPF must be run not only for a local view of the network, but also once for each neighbor, to obtain each neighbor’s cost to any given destination. The backup path must also be somehow stored once calculated, so the forwarding plane can quickly switch over to it if the primary path fails.
•速度:理论上,快速重路由机制通过提高控制平面本地反应速度来降低控制平面对拓扑变化做出反应的全局速度。
• Speed: In theory, fast reroute mechanisms reduce the global speed at which the control plane must react to changes in the topology by increasing the speed at which they can react locally.
•表面:由于“在线”协议没有变化,因此当使用 LFA 进行快速重新路由时,控制平面和任何其他系统之间的表面几乎没有变化。
• Surface: As there are no changes to the protocol “on the wire,” there is little change in the surface between the control plane and any other system when using LFAs for fast reroute.
从协议的角度来看,复杂性权衡的负面结果很小,而正面结果却相当大。从设计角度来看(第 6 章“管理设计复杂性”中有更全面的介绍),LFA 不提供许多常用拓扑的覆盖范围。对于网络工程师决定是否部署 LFA,必须考虑两件事:
From a protocol perspective, the negative results of the tradeoff in complexity are minimal, while the positive results are fairly large. From a design perspective (covered more fully in Chapter 6, “Managing Design Complexity”), LFAs don’t provide coverage for many commonly used topologies. For the network engineer determining whether to deploy LFAs or not, two things must be considered:
• LFA 将覆盖网络的哪些部分,不覆盖哪些部分?
• Which parts of the network will LFAs cover, and which will they not cover?
• 部署LFA 来覆盖它们将覆盖的网络部分是否值得增加额外的复杂性?这里需要考虑的一个具体点是,如果网络的某些部分不会被覆盖,部署 LFA 可能不会以可测量的方式增加端到端路径的延迟/抖动特性(或者更确切地说,部署 LFA 不允许运营商要求固定的高速收敛时间)。
• Is it worth the additional complexity of deploying LFAs to cover the parts of the network they will cover? A specific point that needs to be considered here is that if there are parts of the network that will not be covered, deploying LFAs may not increase the delay/jitter characteristics of the end-to-end path in a measurable way (or rather, deploying LFAs won’t allow the operator to claim a fixed, high-speed convergence time).
关键点是从协议、配置、故障排除和管理角度考虑快速收敛、网络覆盖的业务需求和增加的复杂性之间的权衡。
The crucial point is to consider the tradeoff between business requirements for fast convergence, network coverage, and the added complexity at a protocol, configuration, troubleshooting, and management perspectives.
主拓扑 LFA 无法为环提供快速重路由,如图7.8所示。
The primary topology LFAs cannot provide fast reroute for is a ring, as illustrated in Figure 7.8.
在此网络中,路由器 A 有两条到 2001:db8:0:1::/64 的无环路路径,但在考虑 LFA 时,它会将经过路由器 B 的路径标记为潜在环路。RouterA如何使用路径[A,B,F,E]?问题是,在网络拓扑转换期间,路由器 A 发送到路由器 B 的任何流量(目的地为 2001:db8:0:1::/64 上的主机)可能会被路由器 B 环回(请参阅第 5 章, “设计复杂性,”了解更多详细信息)。如何解决这个问题呢?如果路由器 A 可以弄清楚如何将流量通过隧道传输到始终指向路由器 E 的路由器以到达 2001:db8:0:1::/64,那么它可以在拓扑更改期间通过此隧道转发流量,并且仍然可以确定流量不会循环(将到达正确的目的地)。
In this network, Router A has two loop free paths to 2001:db8:0:1::/64, but it will mark the path through Router B as a potential loop when considering LFAs. How can Router A use the path [A,B,F,E]? The problem is that during transitions in the network topology, any traffic Router A sends to Router B, destined to a host on 2001:db8:0:1::/64, might be looped back by Router B (see Chapter 5, “Design Complexity,” for more details). How can this problem be solved? If Router A can figure out how to tunnel the traffic to a router that always points to Router E to reach 2001:db8:0:1::/64, then it can forward traffic over this tunnel during topology changes and still be certain the traffic will not loop (will reach the correct destination).
路由器 A 如何找出满足此条件的潜在隧道端点?NotVia 是一种解决方案。假设网络管理员确定需要保护链路 [D,E] 免遭故障。以相当简化的形式:
How can Router A find out about potential tunnel endpoints that meet this criteria? NotVia is one solution. Assume that the network administrator determines the link [D,E] needs to be protected from failure. In a rather simplified form:
• 路由器E 配置有特殊的IP 地址。该地址称为E,而不是 D,以供本说明全文参考。
• Router E is configured with a special IP address. This address is called E notvia D for reference throughout this explanation.
• E notvia D由路由器 E 通告给除路由器 D 之外的所有其他邻居。在这种情况下,E notvia D仅通告给路由器 F。
• E notvia D is advertised by Router E to every other neighbor than Router D. In this case, E notvia D is advertised only to Router F.
• 路由器A 仅从路由器B 接收此通告,无论[A,D,E] 路径的状态如何,都会为其提供一条到路由器E 的路由。
• Router A receives this advertisement only from Router B, giving it a route to Router E no matter what the state of the [A,D,E] path is.
• 当路由器A 计算到2001:db8:0:1::/64 的路径时,它首先找到沿[A,D,E] 的最佳(最低成本)路径,并将其安装到本地表中。
• When Router A calculates the path to 2001:db8:0:1::/64, it first finds the best (lowest cost) path along [A,D,E], and installs this in the local table.
• 然后,路由器A 搜索到达该目的地的备用路径。发现Router B在路径中,实际上有一条路由E notvia D,它安装notvia路由作为到目的地的备份隧道路径。
• Router A then searches for an alternate path to this destination. Finding Router B is in the path, and there is, in fact, a route E notvia D, it installs the notvia route as a backup tunnel path to the destination.
• 如果路径[A,D,E] 出现故障,路由器A 会将目的地为2001:db8:0:1::/64 的流量切换到隧道路径。流量封装在 [A,B,F,E] 后面的标头中。路由器 E 删除该外部报头并根据数据包内部的信息进行转发。由于隧道数据包中的内部标头的目的地是 2001:db8:0:1::/64 上的主机,因此数据包将从直接连接的接口转发出去。
• If the path [A,D,E] fails, Router A switches traffic destined to 2001:db8:0:1::/64 to the tunneled path. Traffic is encapsulated in a header that follows [A,B,F,E]. Router E removes this outer header and forwards based on the information inside the packet. Because the inner header in the tunneled packet is destined to a host on 2001:db8:0:1::/64, the packet is forwarded out the directly connected interface.
笔记
Note
此序列假定从路由器 A 到路由器 E 的最佳路径是沿着 [A,B,F,E]。因为情况可能并非如此,NotVia 会改为计算到路由器 F。如何做到这一点超出了这个概念的简单解释的范围。
This sequence assumes that the best path to Router E from Router A is along [A,B,F,E]. Because this may not be the case, NotVia would calculate to Router F instead. How this is done is beyond the scope of this simple explanation of the concept.
从复杂性的角度来看,NotVia 的表现如何?
How does NotVia fare from a complexity perspective?
•状态(携带):NotVia 需要为网络中的每个受保护链路或节点提供额外的IP 地址。这可能(取决于有多少链路或设备受到保护)是注入控制平面的大量附加 IP 地址,并且这些地址必须在协议中以某种方式标记为 notvia 地址,因此它们不用于正常转发。因此,NotVia 增加了状态。事实上,这种附加状态是 NotVia 被拒绝作为互联网标准的主要原因。
• State (carried): NotVia requires an additional IP address for each protected link or node in the network. This may (depending on how many links or devices are protected) be a large number of additional IP addresses injected into the control plane—and these addresses must be somehow marked in the protocol as notvia addresses, so they’re not used for normal forwarding. State is, therefore, increased with NotVia. In fact, this additional state is the primary reason NotVia was rejected as an Internet Standard.
•状态(本地):参与控制平面的每个设备都需要保留额外的notvia地址,并计算包含这些链路的本地 SPT,以发现备用路由。计算时间可能很短,但有一些额外的状态。
• State (local): Each device participating in the control plane will need to keep the additional notvia addresses, and calculate their local SPT with these links included to discover alternate routes. The calculation time is likely minimal, but there is some additional state.
•速度:理论上,快速重路由机制通过提高控制平面本地反应速度来降低控制平面对拓扑变化做出反应的全局速度。
• Speed: In theory, fast reroute mechanisms reduce the global speed at which the control plane must react to changes in the topology by increasing the speed at which they can react locally.
•表面:虽然协议发生了变化,但控制平面与其交互的其他网络系统之间的交互表面几乎没有变化。NotVia确实加深了路由器之间的交互面,然而,有两种方式。首先,控制平面中必须携带、计算和依赖额外的状态。其次,在控制平面中托管NotVia地址的任何设备上都添加了“开放隧道” 。
• Surface: Although there are protocol changes, there is little change in the interaction surfaces between the control plane and the other network systems it interacts with. NotVia does deepen the interaction surfaces between routers, however, in two ways. First, there is additional state that must be carried, computed, and relied on in the control plane. Second, there is the addition of an “open tunnel” on any device in the control plane hosting a NotVia address.
NotVia 确实向控制平面添加了一些状态,因此确实增加了控制平面的复杂性。
NotVia does add some state to the control plane, and hence does increase control plane complexity.
NotVia 的替代方案是路由器 A(如图 7.8所示)使用某种其他机制来计算将到达 2001:db8:0:1::/64 的远程下一跳,即使其通过路由器 D 的主路径已发生故障,并且找到一些替代方法将数据包通过隧道传送到该中间路由器。远程 LFA 正好提供了这样的解决方案。要计算远程 LFA:
An alternative to NotVia is for Router A, in Figure 7.8, to use some other mechanism to calculate a remote next hop that will reach 2001:db8:0:1::/64 even if its primary path through Router D has failed, and find some alternate way to tunnel packets to this intermediate router. Remote LFAs provide just such a solution. To calculate a remote LFA:
• 路由器A 从其邻居的邻居的角度计算SPT 。在这种情况下,路由器 A 将从路由器 F 的角度计算到 2001:db8:0:1::/64 的最佳路径。
• Router A calculates a SPT from its neighbor’s neighbor’s perspective. In this case, Router A would calculate the best path to 2001:db8:0:1::/64 from the perspective of Router F.
• 发现Router F 实际上有到目的地的无环路径,Router A 就可以建立一条到Router F 的隧道,并将该隧道作为通过Router D 的主路由的备份。
• Finding Router F does, in fact, have a loop free path to the destination, Router A can then build a tunnel to Router F, and install this tunnel as a backup to the primary route through Router D.
在这个解决方案中,主要问题是应该使用什么形式的隧道来构建这些备用路径。这个问题最常见的答案是 MPLS,因为它是轻量级的,并且具有动态构建隧道所需的所有信令。鉴于某种类型的 MPLS 信令已因其他原因在网络中运行,从网络管理的角度来看,添加远程 LFA 是一项相当简单的工作。当然,动态创建的隧道可以为故障排除过程增添一些乐趣;可能需要花费一些时间和仔细的工作来弄清楚为什么特定流通过动态隧道的网络中的特定链路传输,因为各个设备上的配置与创建的实际隧道之间没有联系。
The main question, in this solution, is what form of tunneling should be used to build these alternate paths. The most common answer to this question is MPLS, as it is lightweight, and has all the signaling necessary to dynamically build tunnels. Given MPLS signaling of some type is already running in the network for other reasons, adding Remote LFAs is a fairly trivial exercise from a network management perspective. Dynamically created tunnels can add some excitement to the troubleshooting process, of course; it can take some time and careful work to sort out why specific flows are traveling over specific links in the network with dynamic tunnels, because there’s no connection between the configurations on individual devices and the actual tunnels created.
同样,值得研究的是远程 LFA 与此处使用的复杂性模型的比较。
Again, it’s worth examining how Remote LFAs compare to the complexity model used here.
•状态(携带):远程LFA 不会专门向路由协议添加任何新信息,但它们确实需要某种方法来通过网络构建备份隧道。如果某种形式的动态隧道系统已经在网络上运行(例如,支持 MPLS 虚拟专用网络或流量工程/MPLS-TE),那么这个额外的状态很小。如果必须部署某种形式的动态隧道信令来支持远程 LFA,那么额外的控制平面状态以及复杂性的增加可能会令人望而生畏。
• State (carried): Remote LFAs don’t add any new information to the routing protocol specifically, but they do require some way to build backup tunnels through the network. If some form of dynamical tunneling system is already running on the network (for instance, to support MPLS virtual private networks or traffic engineering/MPLS-TE), then this additional state is very small. If some form of dynamic tunneling signaling must be deployed to support Remote LFAs, then the additional control plane state—and increase in complexity—could be daunting.
•状态(本地):参与控制平面的每个设备都需要执行一组附加的 SPF 计算,并存储有关备份隧道的信息。当然,还必须维护备份隧道的状态,因此这很可能是大量状态(因此,复杂性很高)。
• State (local): Each device participating in the control plane will need to perform an additional set of SPF calculations, and to store information about a backup tunnel. Of course, the state of the backup tunnel must be maintained, as well, so this could well be a large amount of state (and hence, a large amount of complexity).
•速度:理论上,快速重路由机制通过提高控制平面本地反应速度来降低控制平面对拓扑变化做出反应的全局速度。
• Speed: In theory, fast reroute mechanisms reduce the global speed at which the control plane must react to changes in the topology by increasing the speed at which they can react locally.
•表面:远程LFA 向网络添加至少一个新系统,这是一种动态构建和管理备用路径隧道覆盖的机制。这增加了几个新的交互表面,例如控制平面和隧道信令系统之间的接口,以及物理拓扑和隧道覆盖。如果由于其他原因部署了动态隧道机制,那么这些额外的交互表面已经就位,并且网络中的复杂性不会增加。然而,如果部署动态隧道来支持远程 LFA,则它们的部署将增加网络中交互表面的数量。
• Surface: Remote LFAs add at least one new system to the network, some mechanism to dynamically build and manage an overlay of alternate path tunnels. This adds several new interaction surfaces, such as the interface between the control plane and the tunnel signaling system, and the physical topology and the tunneled overlay. If a dynamic tunneling mechanism has been deployed for some other reason, then these additional interaction surfaces are already in place, and complexity isn’t increased in the network. If dynamic tunneling is being deployed to support Remote LFAs, however, their deployment will increase the number of interaction surfaces in the network.
笔记
Note
此权衡列表假设标签分发协议 (LDP) 实际上在构建备份隧道所需的所有节点上运行,因此添加远程 LFA 的成本主要在于网络中携带的状态量,以及额外的隧道、端点等。事实上,如果部署远程 LFA 需要将 LDP 部署到网络中更广泛的节点(在大多数网络设计中这是一种明显的可能性),则需要提高网络性能,或者业务驱动因素需要更多坚持,以证明在其他节点上部署 LDP 的合理性。在考虑复杂性权衡时,应始终包括将协议和解决方案部署到网络中其他节点的成本。
This list of tradeoffs assumes that Label Distribution Protocol (LDP) is, in fact, running on all the nodes required to build backup tunnels, so the cost to add Remote LFAs is primarily in the amount of state carried in the network, as well as additional tunnels, endpoints, etc. In fact, if deploying Remote LFAs requires deploying LDP to a wider set of nodes in the network—a distinct possibility in most network designs—the increased network performance would need to be greater, or the business drivers more insistent, to justify deploying LDP on the additional nodes. The cost of deploying protocols and solutions onto additional nodes in the network should always be included when considering the complexity tradeoffs.
在大规模网络时代的早期,EIGRP 作为一种路由协议而享有盛誉,您可以“将其应用于任何拓扑,无需大量设计工作,并且会正常工作”。事实上,这种信念有很多道理——事实上,这种说法有点太真实了。很快,很大在运行 EIGRP 时,计划外的网络开始出现严重故障,从而得出相反的结论:EIGRP 是一个非常糟糕的控制平面。在某种程度上,EIGRP 是其早期成功支持大型复杂网络而很少考虑实际设计的受害者。
Early in the era of large-scale networking, EIGRP gained a reputation as being a routing protocol you could “throw onto any topology, without a lot of design effort, and will just work.” There is, in fact, a good deal of truth to this belief—in fact, there’s a bit too much truth to this statement. Soon enough, very large unplanned networks began failing in a big way while running EIGRP—leading to the opposite conclusion, that EIGRP is a really horrible control plane. In a way, EIGRP is a victim of its own early success in supporting large complex networks with little thought to actual design.
为什么 EIGRP 在几乎没有设计工作的情况下就能在大型网络中表现得如此出色?使用图 7.9中的拓扑对 EIGRP 操作进行简短回顾可能有助于解释。
Why did EIGRP hold up so well in large-scale networks with little design effort? A short review of EIGRP operation might be helpful in explaining, using the topology in Figure 7.9.
如果路由器 A 失去与 2001:db8:0:1::/64 的链接,会发生什么情况?
If Router A loses its link to 2001:db8:0:1::/64, what happens?
1. Router A 首先检查其本地信息以确定该目的地是否存在 LFA( EIGRP 中的FS )。
1. Router A will first examine its local information to determine if there is an LFA (a FS in EIGRP) for this destination.
2.如果没有找到,它将将该路由标记为活动(这意味着 EIGRP 正在主动寻找到该目的地的备用路由),并向其每个邻居发送查询,在本例中为路由器 D。
2. Finding none, it will mark the route active (which means EIGRP is actively looking for an alternate route to this destination), and send a query to each of its neighbors, in this case Router D.
3.路由器 D 在收到此查询后,将检查其本地表并发现路由器 A 是 2001:db8:0:1::/64 可用的唯一路径。路由器 D 会将该路由标记为活动状态,并向路由器 E发送查询。
3. Router D, on receiving this query, will examine its local tables and discover Router A is the only path available to 2001:db8:0:1::/64. Router D will mark this route active, and send a query to Router E.
4.路由器E将效仿路由器D,向路由器F发送查询。
4. Router E will follow Router D’s example, sending a query to Router F.
5. Router F 将按照 Router E 的示例,向 Router G发送查询。
5. Router F will follow Router E’s example, sending a query to Router G.
6.路由器 G 会发现它没有邻居可供查询,因此它将标记 2001:db8:0:1::/64 unreachable,并向路由器 F 发送回复,表明它没有到达目的地的备用路径。
6. Router G will find it has no neighbors to query, so it will mark 2001:db8:0:1::/64 unreachable, and send a reply to Router F stating it has no alternate path to the destination.
7.路由器F收到此回复后,将标记该路由不可达,并向路由器E发送回复,表明它没有到达此目的地的备用路径。
7. Router F, on receiving this reply, will mark the route unreachable, and send a reply to Router E stating it has no alternate path to this destination.
8.此回复链将继续,直到路由器 A 收到来自路由器 D 的回复。此时,该路由已从路由器 A 的路由表中删除,并且路由器 A 发送以下更新,其中包含 2001:db8:0 的不可达度量值: 1::/64。这将从其他路由器的表中删除该目的地。
8. This chain of replies will continue until Router A receives a reply from Router D. At this point, the route is removed from Router A’s routing table, and Router A sends a following update with an unreachable metric for 2001:db8:0:1::/64. This removes the destination from the other router’s tables.
在如图 7.9所示的简单网络中,整个查询过程似乎是多余的,甚至“不是很有用”。实际上,EIGRP 中的整个查询机制旨在查找先前标记为环路或由于水平分割而未报告的备用路径。因此,EIGRP 中的查询过程的行为非常类似于链路状态协议中用于发现远程 LFA 的机制,只是它作为正常收敛的一部分内置到协议中,而不是作为快速重新路由机制添加。
In simple networks like the one shown in Figure 7.9, this entire query process seems redundant—and even “not very useful.” In reality, the entire query mechanism in EIGRP is designed to find alternate paths previously marked as loops, or not reported because of split horizon. As such, the query process in EIGRP acts much like the mechanisms used in a link state protocol to discover a remote LFA—only it’s built into the protocol as a part of normal convergence, rather than added on as a fast reroute mechanism.
这个过程有几个有趣的点值得注意:
There are several interesting points about this process worth noting:
• 通过此过程携带的状态量是最少的。由于EIGRP是距离矢量协议,不携带拓扑信息,仅携带可达性信息。
• The amount of state being carried through this process is minimal. As EIGRP is a distance vector protocol, no topology information is carried, only reachability information.
• 通过这种扩散更新过程,EIGRP 将查找任何可用备用路径的负载分散到故障域内的每个路由器。这种负载分散机制使 EIGRP 对于大规模环境非常稳健(可能有 50 万条或更多路由,有数百个邻居,尤其是在中心辐射型拓扑中)。
• Through this diffusing update process, EIGRP spreads the load of finding any available alternate paths to every router within the failure domain. This load spreading mechanism makes EIGRP extremely robust to large-scale environments (half a million routes or more is possible, with hundreds of neighbors, especially in hub-and-spoke topologies).
• 发现任何备用路由的过程类似于在树本身上实时运行SPF。由于这是以串行方式执行的,因此 EIGRP 中不会形成微循环。数据包在收敛期间将被丢弃,但不会循环。
• The process of discovering any alternate routes is similar to running SPF on the tree itself, in real time. Because this is performed in a serialized way, no microloops are ever formed in EIGRP. Packets will be dropped during convergence, but never looped.
那么为什么 EIGRP 在相对设计自由的网络中能够如此良好地扩展呢?因为与大多数其他协议相比,控制平面中携带的状态最少,并且扩散更新算法 (DUAL) 在通过网络分散收敛负载方面做得很好。
So why does EIGRP scale so well in relatively design free networks? Because the state carried in the control plane is minimal compared to most other protocols, and because the Diffusing Update Algorithm (DUAL) does such a good job at spreading the convergence load through the network.
那么,为什么 EIGRP 网络会出现故障呢?EIGRP 可能会遇到几种特定的网络情况,这些情况会导致某种类型的收敛失败,包括:
Why, then, do EIGRP networks fail? EIGRP can encounter several specific network situations that will cause a convergence failure of some type, including:
• 极长的“查询尾部”。只要查询本身不包含目的地的本地知识,EIGRP 查询就会停止,这通常意味着“网络边缘”,或者聚合可达性信息的某个地方。如果网络中没有配置聚合,则每个查询都将由网络中的每个路由器处理。如果单个链路发生故障,就会导致数千个目的地无法访问当 EIGRP 脱离网络时,使 EIGRP 稳健的分布式收敛过程实际上与协议背道而驰,导致比使网络收敛所需的工作多得多。
• Extremely long “query tails.” An EIGRP query stops wherever there is no local knowledge of the destination contained in the query itself—this generally means “the edge of the network,” or someplace where reachability information is aggregated. If there is no aggregation configured in the network, each and every query will be processed by every router in the network. If a single link fails that causes thousands of destinations to be dropped off the network, the distributed convergence process that makes EIGRP robust actually works against the protocol, causing a lot more work than is necessary to bring the network to convergence.
• 大量并行路径。由于每个 EIGRP 路由器都会向其每个邻居发送查询,因此大量并行路径会导致通过网络传输大量查询。这使 EIGRP 状态机工作非常困难,再次导致 DUAL 违背其更好的属性。
• High numbers of parallel paths. Because each EIGRP router sends a query to each of its neighbors, a high number of parallel paths cause a large number of queries to be transmitted through the network. This works the EIGRP state machine very hard, again causing DUAL to work against its better attributes.
• 内存量或处理性能较低的网络设备与具有大处理器和大量内存的设备配对。网络中一组不匹配的设备意味着一个路由器可以快速向根本无法处理处理负载的邻居发送数千个查询。在这种情况下,DUAL 过程再次开始发挥作用。
• Network devices with low amounts of memory or processing performance paired with devices that have big processors and lots of memory. A mismatched set of devices in the network means one router can quickly send thousands of queries to a neighbor that simply cannot handle the processing load. Again, the DUAL process begins to work against itself in these types of situations.
这些情况最明显的结果就是卡在活动路线中。当 EIGRP 网络中的每个路由器发送查询时,它都会设置一个计时器;路由器期望在这段时间内收到该查询的答复。如果在此时间段内没有收到回复,则该路由被声明为处于 active 状态,并且与未收到回复的邻居的邻接关系将被重置。
The most obvious result of any of these situations is a stuck in active route. As each router in the EIGRP network transmits a query, it sets a timer; the router expects to receive an answer for this query within this period of time. If no reply is received within this time period, the route is declared stuck in active, and the adjacency with the neighbor for which no reply was forthcoming will be reset.
正如您可以想象的那样,当协议已经承受大规模收敛事件的压力时,您最不想做的事情就是重置邻居邻接关系,从而导致另一轮查询,从而给网络带来更大的压力。为什么协议设计成这样工作?因为,实际上,一旦定时器启动,控制平面就已经脱离了 DUAL 定义的有限状态机。除了重置状态机之外,没有真正的方法可以解决该问题。
As you can imagine, the last thing you want to do when the protocol is already under the stress of a large-scale convergence event is to reset the neighbor adjacencies, causing another round of queries, and hence more stress in the network. Why was the protocol designed to work this way? Because, in effect, once the timer has fired the control plane has stepped outside the finite state machine defined by DUAL. There is no real way to resolve the problem other than to reset the state machine.
多年来,当许多工程师发现网络中的活动路由卡住时,他们只是简单地增加卡住活动计时器。然而,这恰恰是解决问题的错误方法。如果您将卡在活动计时器中的时间视为您愿意允许网络保持不收敛的时间量(因此将数据包丢弃到实际上应该可达的目的地),您可以立即看到以下不良副作用增加这个计时器。相反,网络和协议都需要共同努力来解决这种情况。
For years, many engineers simply increased the stuck in active timer when they would see stuck in active routes in their network. This is, however, precisely the wrong solution to the problem. If you view the stuck in active timer as the amount of time you’re willing to allow the network to remain unconverged (and hence dropping packets to a destination that should, actually, be reachable), you can immediately see the bad side effects of increasing this timer. Instead, the network and protocol both need to work to resolve this situation.
在大规模环境中使用该协议的实际经验揭示了这些问题。作为回应,采取了两条行动:
Real field experience with the protocol in large-scale environments revealed these problems. In response, two lines of action were taken:
• EIGRP 作为一种协议,几乎可以在您可以配置它的任何东西上运行,这一看法已转变为更加平衡的观点。事实上,EIGRP 可以容忍各种拓扑和网络状况。然而,仍然需要考虑一些具体的设计参数,例如故障域的大小、部署的设备的质量以及其他因素。在复杂性领域,这实际上将协议处理的一些复杂性“抛回了隔间墙”,由网络设计者来处理。这里有一个关于平衡复杂性位置的重要教训。
• The perception of EIGRP as a protocol that would run over just about anything you could configure it on was changed to a more balanced view. EIGRP will, in fact, tolerate a wide variety of topologies and network conditions. However, there still needs to be thought put into some specific design parameters, such as the size of a failure domain, the quality of the equipment deployed, and other factors. In the realm of complexity, this is effectively throwing some of the complexity being handled by the protocol “back over the cubicle wall,” to be handled by the network designer. There is an important lesson in balancing the location of complexity here.
• EIGRP停留在活动进程中的情况已被修改,以允许协议更优雅地在 DUAL 状态机内停留更长时间,同时减少对实际网络操作的影响。具体来说,陷入活跃状态的到期计时器导致通过查询链传输新的查询,而不是重置邻居邻接关系。这使得具有较小处理器和内存池的路由器有足够的时间来处理大量查询或解决本地邻接问题,而不会影响整个网络。从复杂性角度来看,这组添加的数据包和计时器增加了协议的复杂性,以解决实际部署中出现的一些问题。在这里,部署的复杂性与协议的复杂性进行了权衡,因为协议的复杂性似乎并不像重新设计全球每个 EIGRP 网络那样难以管理。
• EIGRP’s stuck in active process was modified to allow the protocol to more gracefully remain inside the DUAL state machine for longer while decreasing the impact on the actual network operation. Specifically, the expiration of the stuck in active timer caused a new query to be transmitted through the query chain, rather than resetting the neighbor adjacency. This allowed routers with smaller processors and memory pools the time they needed to process large groups of queries, or a local adjacency problem to be sorted out without impacting the network as a whole. In complexity terms, this added set of packets and timers increased the complexity of the protocol to resolve some of the problems being seen in actual deployments. Here, complexity in deployment was traded off against protocol complexity, as the protocol complexity didn’t appear to be as difficult to manage as redesigning every EIGRP network in the world wholesale.
通过多年的经验,EIGRP 的开发和修改表明了在现实世界中必须如何处理复杂性权衡。有时最好将复杂性转移到其他地方,有时最好将复杂性保留在协议中。这完全取决于工程师面临的具体情况。
The development, and modification, of EIGRP through years of experience shows how complexity tradeoffs must be handled in the real world. Sometimes it’s best to move the complexity someplace else, sometimes it’s best to keep the complexity in the protocol. It all depends on the specific situation the engineer faces.
协议、设计和系统复杂性的交叉本身就是一个复杂的话题。本章实际上只是触及了这个广泛而有趣的研究领域的表面。然而,这里的例子和概述的想法应该为你提供坚实的工作知识核心,一套工具,至少可以帮助你认识到任何给定情况下的权衡,并三思而行“把复杂性扔到隔间墙上”。 ”
The intersection of protocol, design, and system complexity is itself a complex topic. This chapter really only scratches the surface of this broad and interesting area of investigation. However, the examples and outlined thoughts here should provide you with a solid core of working knowledge, a set of tools that will help you at least recognize the tradeoffs in any given situation, and to think twice about “tossing complexity over the cubicle wall.”
每个幸福的家庭都以大致相同的方式幸福,但每个不幸福的家庭都以完全独特的方式痛苦。
Every happy family is happy in much the same way, but every unhappy one is miserable in completely unique ways.
只需稍加改动即可将其应用到网络世界中:“每个成功的网络都以与世界上其他网络大致相同的方式实现这一点,但每个失败的网络都以完全独特的方式实现这一点。” 扭转这一局面,这意味着学习网络设计的最佳方法之一就是解决失败问题。工程师接触到的每一次失败的网络设计也会给设计带来一些教训。
To apply this to the network world with just a slight change—“every network that succeeds does so in much the same way as every other network in the world, but every network that fails does so in a completely unique way.” Turning this around, this means one of the best ways to learn network design is to work on failures. Every failed network design an engineer touches teaches some lesson in design, as well.
但要真正学习网络工程,需要一点故障理论——对所谓的“故障理论”的理解提供了一个背景或一个框架,工程师不仅可以将故障放入其中,还可以提出新的想法。通过了解要寻找什么样的故障模式,然后实际寻找,可以有效地评估协议、网络设备和网络作为一个系统的新(和旧)想法的潜在故障。网络工程中经常发生的事情是找到需要解决的问题,然后解决它,而不考虑复杂性权衡,也不考虑协议或网络中可能导致的故障模式。
But to really learn network engineering, a little theory of failure is called for—an understanding of what might be called “failure theory” provides a context, or a framework, into which an engineer cannot only place failures, but also new ideas. New (and old) ideas in protocols, network devices, and networks as a system can be effectively evaluated for potential failures by knowing what sorts of failure modes to look for—and then actually looking. What tends to happen in network engineering is finding a problem that needs to be solved, and then solving it—without thought for the complexity tradeoffs, and without considering the failure modes that might result up in the protocol or network.
潜在故障模式的整个概念与意外后果的概念以及复杂性曲线左下角的不可到达空间(或 CAP 定理的不可填充三角形,早在第 1 章“定义复杂性”中讨论过)一起工作。 。
This entire concept of potential failure modes works alongside the concept of unintended consequences, and the unreachable space at the bottom left corner of the complexity curve (or the unfillable triangle of the CAP theorem, discussed way back in Chapter 1, “Defining Complexity”).
This chapter will discuss two major reasons why networks fail from a theoretical perspective:
• 正反馈循环
• Positive feedback loops
• 命运共同体
• Shared fate
这些类别可能看起来很广泛,但本章将通过一些示例来充实它们,这些示例应该可以帮助您了解在每种情况下要寻找的内容。还将检查这些故障模式,以了解它们如何与其他常见的网络故障原因相互作用,例如微调控制平面的速度或重新分配。
These categories might seem broad, but this chapter will flesh them out with a number of examples that should help you understand what to look for in each case. These failure modes will also be examined to see how they interact with other commonly asserted causes of network failure, such as the speed of a finely tuned control plane, or redistribution.
反馈环路可用于许多不同的用途,例如,在锁相环路中,电路的反馈经过仔细控制以产生恒定频率波形或载波。事实上,锁相环构成了几乎所有现代无线电系统的基础。反馈回路的另一个例子是控制进入内燃机和喷气推进发动机的空气和燃料流的各种类型的机构。如果没有这些反馈回路,这些类型的发动机只能以非常低的水平运行,并且效率较低。
Feedback loops are useful for a lot of different things—for instance, in a phase locked loop, the feedback of the circuit is carefully controlled to produce a constant frequency waveform or carrier. Phased locked loops form the foundation, in fact, of almost all modern radio systems. Another example of a feedback loop is the various types of mechanisms controlling the air and fuel flow into internal combustion and jet propulsion engines. Without these feedback loops, these types of engines can only run at a very minimal level, and at a somewhat low efficiency.
反馈有两种:负面反馈和正面反馈。图 8.1使用非常简单的振荡信号方式解释了负反馈环路。
There are two kinds of feedback: negative and positive. Figure 8.1 illustrates a negative feedback loop using a very simple oscillating signal style of explanation.
在该图中,有两个设备;您无需了解此处所示的组件实际上如何工作即可理解反馈循环的概念:
In this figure, there are two devices; you don’t need to understand how the components illustrated here actually work to understand the concept of a feedback loop:
•添加:该设备仅获取两个输入信号并将它们组合起来以创建单个输出信号。
• Add: This device simply takes the two input signals and combines them to create a single-output signal.
• Tap:该设备仅复制呈现给一个输入的信号,进行任何编程修改,然后在其第二个输出接口上输出结果。
• Tap: This device simply replicates the signal presented to one input, does any programmed modification, and outputs the result on its second output interface.
在这种情况下,TAP 设备被配置为反转信号并稍微降低其强度(更准确地说是幅度)。该插图中发生的情况是这样的:
In this case, the TAP device is configured to invert the signal and reduce its strength (amplitude in more precise terms) a bit. What happens in this illustration is this:
1.标记为 A 的波形部分穿过 ADD 器件。
1. The section of the wave marked A passed through the ADD device.
2.起初,第二个输入上没有信号,因此波没有变化地通过 ADD 器件。
2. At first, there is no signal on the second input, so the wave passes through the ADD device without change.
3.波继续前进并进入 TAP 设备。
3. The wave continues on and passes into the TAP device.
4.在 TAP 设备处,制作波的一个小副本并反转(因此正峰值与负峰值反转),并且信号的这个小副本沿着路径传回到 ADD 的第二个输入设备。
4. At the TAP device, a small replica of the wave is made and inverted (so the positive peaks are reversed with the negative peaks), and this little replica of the signal is passed back along the path to the second input of the ADD device.
5.从TAP设备反馈的该信号由ADD设备添加到原始信号中。
5. This signal, being fed back from the TAP device, is added to the original signal by the ADD device.
6.因为该信号是反相的,所以两个波的相加具有降低ADD设备的输出强度的效果。这个过程很像取一个大数,将其乘以某个非常小的数(例如 0.01),反转该数(将其从正号变为负号),然后将其加回到原始数。
6. Because this signal is inverted, the addition of the two waves has the effect of reducing the strength of the output of the ADD device. This process is much like taking a large number, multiplying it by some very small number (say 0.01), inverting the number (changing it from a positive sign to a negative), and adding it back to the original number.
7.这会导致 ADD 设备的输出电平降低。
7. This has the effect of causing the output level of the ADD device to decrease.
8.然后,该降低的信号被反馈到 TAP 设备,TAP 设备采用相同比例的接收信号,将其反转,并将其反馈回 ADD 设备。
8. This decreased signal is then fed back into the TAP device, which takes the same proportion of the received signal, inverting it, and feeding it back to the ADD device.
正如您从插图右侧的输出信号中看到的,结果是信号强度不断下降。人们很容易认为这种情况会稳定下来,但它永远不会真正稳定下来——如果电路运行足够长的时间,从 TAP 设备发出的信号最终会减少到无法测量的程度。另一种设想方式该负反馈环路呈螺旋状,信号强度以螺旋状表示,并用图表显示输出信号随时间的推移而减少,如图 8.2所示。
The result, as you can see from the output signal on the right side of the illustration, is a signal that is constantly decreasing in strength. It’s tempting to think this situation will stabilize, but it never really will—leave the circuit running long enough, and the signal exiting the TAP device will eventually be reduced to the point that it cannot be measured. An alternative way to envision this negative feedback loop is as a spiral, with the strength of the signal represented as a spiral and a graph to show the decreasing signal on the output over time as shown in Figure 8.2.
图 8.2左侧的螺旋图将电路的输入和输出显示为一组周期,每个周期都接近螺旋中间的零点,或者信号变得不可测量的点。图 8.2右侧的曲线版本显示了信号强度不断降低的图表,并标记了相同的周期以供参考。
The spiral chart, on the left in Figure 8.2, shows the input and output of the circuit as a set of cycles, with each cycle moving closer to the null point in the middle of the spiral, or the point where the signal becomes unmeasurable. The curve version, on the right in Figure 8.2, shows the constantly reducing signal strength against a graph, with the same cycles marked out for reference.
为了说明正反馈的外观,请反转图 8.1中所示的 TAP 设备的操作;图 8.3说明了结果。
To illustrate a positive feedback look, reverse the action of the TAP device shown in Figure 8.1; Figure 8.3 illustrates the result.
在这种情况下,TAP 设备配置为使信号与原始信号同相,并稍微降低其强度(更准确地说是幅度)。该插图中发生的情况是这样的:
In this case, the TAP device is configured to leave the signal in phase with the original and reduce its strength (amplitude in more precise terms) a bit. What happens in this illustration is this:
1.标记为 A 的波形部分穿过 ADD 器件。
1. The section of the wave marked A passed through the ADD device.
2.起初,第二个输入上没有信号,因此波没有变化地通过 ADD 器件。
2. At first, there is no signal on the second input, so the wave passes through the ADD device without change.
3.波继续前进并进入 TAP 设备。
3. The wave continues on and passes into the TAP device.
4.在 TAP 设备处,制作波的小副本,并且该信号的小副本沿着路径传回 ADD 设备的第二个输入。
4. At the TAP device, a small replica of the wave is made, and this little replica of the signal is passed back along the path to the second input of the ADD device.
5.从TAP设备反馈的该信号由ADD设备添加到原始信号中。
5. This signal, being fed back from the TAP device, is added to the original signal by the ADD device.
6.由于该信号未反转,因此两个波的相加具有增加 ADD 设备的输出强度的效果。
6. Because this signal is not inverted, the addition of the two waves has the effect of increasing the strength of the output of the ADD device.
7.这个过程很像取一个大数,乘以某个非常小的数(例如 0.01),然后将其加回到原始数。
7. This process is much like taking a large number, multiplying it by some very small number (say 0.01), and adding it back to the original number.
8.这会导致 ADD 设备的输出电平增加。
8. This has the effect of causing the output level of the ADD device to increase.
9.然后,该增加的信号被反馈到 TAP 设备,TAP 设备采用相同比例的接收信号并将其反馈到 ADD 设备。
9. This increased signal is then fed back into the TAP device, which takes the same proportion of the received signal and feeding it back to the ADD device.
再次强调,有时以不同的方式说明同一概念很有用,如图8.4所示。
Once again, it is sometimes useful to illustrate the same concept in a different way, as shown in Figure 8.4.
看看这两个插图(图 8.3和8.4),正反馈循环为何危险就显而易见了。第一的,除非有什么东西限制了结果信号的幅度,否则循环就没有尽头。事实上,大多数正反馈循环确实发生在带有某种类型限制器的系统中。大多数时候,限制器是系统本身的一些物理属性。举个例子,考虑一下音频系统中的反馈——当麦克风距离扬声器太近时,您有时会听到很大的刺耳噪音。这种类型的反馈是由扩音器和扬声器系统中的任何噪声(总会有一些)引起的,这些噪音被麦克风(上图中的 TAP)拾取,然后反馈到扩音系统(上图中的 ADD) ),产生更高音量的噪音,然后被麦克风拾取为更大的噪音,从而放大到更高的水平,然后由扬声器播放。
Looking at these two illustrations (Figures 8.3 and 8.4), it should be obvious why positive feedback loops are dangerous. First, unless there is something that limits the amplitude of the resulting signal, there is no end to the loop. In fact, most positive feedback loops do occur in systems with a limiter of some type; most of the time the limiter is some physical property of the system itself. As an example, consider feedback in an audio system—that loud screeching noise you hear sometimes when a microphone is placed too close to a speaker. Feedback of this type is caused by any noise in the amplification and speaker system (and there will always be some) being picked up by a microphone (TAP in the diagrams above), then fed back into the amplification system (ADD in the illustrations above), to create a higher volume noise, which is then picked up by the microphone as a louder noise, and hence amplified to a higher level, and then played by the speakers. The limiting factor in the audio example is the upper limit on the speaker’s volume, combined with the physical properties of the microphone.
其次,一旦系统达到限制器,它将永久停留在那里。TAP/ADD 反馈的作用会立即克服变化率或强度的降低,该反馈会添加所需的任何级别的信号(随着时间的推移)以使系统回到限制点。这在控制系统和电子领域称为饱和。
Second, once the system reaches the limiter, it will stay there permanently. And reduction in the rate or strength of change is immediately overcome by the action of the TAP/ADD feedback, which adds in whatever level of signal is needed (over time) to bring the system back to the limiting point. This is called saturation in the field of control systems and electronics.
笔记
Note
这里的插图表明,正反馈循环总是会导致输出水平的增加,因为在处理中添加了输入和输出的某些部分。然而,许多正反馈循环只是实现稳定性,因此被称为自强化或稳定反馈循环。在这种情况下,循环不会增加输出,而只是将输出稳定在较高的水平。结果是输入到输出的不断放大,而不是不断增加的输出。为了简化讨论,本文中这些情况仍然被视为正反馈循环。
The illustrations here indicate that a positive feedback loop will always result in an increasing level of output, as the input and some part of the output are added in the processing. However, many positive feedback loops simply achieve stability, and hence are called self-reinforcing, or stable, feedback loops. In this case, the loop doesn’t add to the point of increasing output, but simply keeps the output stable at an elevated level. The result is a constant amplification of the input to the output, rather than an ever increasing output. To simplify the discussion, these cases are still treated as positive feedback loops in this text.
正反馈环路和负反馈环路之间的区别在于这些插图中反馈到 ADD 中的数字的符号。如果 ADD 被馈送负数,则反馈回路将为负。如果输入正数,则反馈循环将为正。反馈的陡度(即反馈影响输出信号的速率)受到 TAP 处发出并反馈至 ADD 的原始信号百分比的影响。百分比越高,输出增加或减少的速度越快。
The difference between a positive feedback loop and a negative feedback loop is the sign of the number fed back into the ADD in these illustrations. If the ADD is fed a negative number, the feedback loop will be negative. If it’s fed a positive number, the feedback loop will be positive. The steepness of the feedback, the rate at which it impacts the output signal, is impacted by the percentage of the original signal that’s taken off at the TAP and fed back to the ADD. The higher the percentage, the greater the speed at which the output increases or decreases.
反馈循环(特别是正反馈循环)与网络工程有何关系?回到第 2 章“复杂性的组成部分”中描述的网络复杂性的原始模型,反馈循环与复杂性模型的所有三个部分相互作用:状态、速度和表面。通过一些示例将是理解反馈循环如何与复杂性相关的最佳方式;本节将考虑数据包复制环路、重新分配环路和链路抖动控制平面故障环路。
How do feedback loops—positive feedback loops in particular—relate to network engineering? Returning to the original model of network complexity described in Chapter 2, “Components of Complexity,” feedback loops interact with all three pieces of the complexity model: state, speed, and surface. Working through some examples is going to be the best way to understand how feedback loops relate to complexity; this section will consider a packet duplication loop, a redistribution loop, and a link flap control plane failure loop.
在任何给定网络中,数据包循环比最初想象的更为常见,但它们并不总是形成反馈循环,因此并不总是与网络故障相关。图 8.5说明了典型的数据包循环。
Packet loops are more common than might be initially supposed in any given network—but they don’t always form feedback loops, and hence aren’t always related to a network failure. Figure 8.5 illustrates a typical packet loop.
图 8.5的左半部分说明了一个不发生数据包复制的循环,因此不存在立即明显的正反馈循环。主机A发送的报文经过RouterB转发到RouterC,然后再转发到RouterD,然后返回RouterB,再转发到RouterC。但是,存在一个自续环路,或者说足够多的环路。正反馈以维持网络中链路上增加的流量水平。如果主机 A 传输足够的流量,则这种自维持环路仍然会导致流量达到环路的限制器或饱和点。
The left half of Figure 8.5 illustrates a loop in which packet replication does not take place—so there is no immediately obvious positive feedback loop. Packets transmitted by Host A are forwarded by Router B to Router C, which are then forwarded to Router D, then back to Router B, where they are again forwarded to Router C. There is, however, a self-sustaining loop, or enough positive feedback to sustain the increased traffic levels across the links in the network. This self-sustaining loop can still cause the amount of traffic that can reach the limiter, or saturation point, for the loop, if Host A is transmitting enough traffic.
要理解其中的原因,请考虑防止流量永远沿着此环路传递的唯一方法是在数据包本身中设置最大跳数(生存时间)。如果生存时间设置得非常高,以允许宽直径网络(具有很多跳数的网络),则主机 A 传输的每个数据包都会增加网络上的负载,而较旧的数据包仍在倒计时为了活着。如果生存时间设置为 16,则主机 A 最多可以消耗从自身到路由器 B 以及其他路由器的链路上可用带宽的 16 倍。网络中的链接。如果主机 A → 路由器 B 链路为 1g,则主机 A 在 [B,C]、[C,D] 和 [D,B] 链路上最多可消耗 16g(合计,或每条链路上 16g 的三分之一)三个链接)。
To understand why, consider that the only way to prevent traffic from passing along this loop eternally is to put a maximum number of hops in the packet itself—a time to live. If the time to live is set very high, to allow for a wide diameter network (a network with a lot of hops), then each packet Host A transmits will increase the load on the network while older packets are still counting down to their time to live. If the time to live is set to 16, Host A can consume up to 16 times the bandwidth available along the link from itself to Router B along other links in the network. If the Host A → Router B link is 1g, then Host A can consume up to 16g across the [B,C], [C,D], and [D,B] links (combined, or one third of 16g on each of the three links).
图 8.5的右侧说明了具有正反馈特性的转发环路。这里,路由器 E 发送到路由器 F 的数据包由路由器 G 转发到路由器 H 和 K,每个路由器都会复制该数据包并将其发送回路由器 F。单个数据包每次穿过环路时都会加倍;那么,第一轮,主机E发送的单个报文就变成了2个,第二轮就变成了4个,第三轮就变成了8个,以此类推。显然,这只是多播或广播流量情况下的问题;单播流量通常不会在路由器 F 上复制。
The right side of Figure 8.5 illustrates a forwarding loop with positive feedback characteristics. Here a packet transmitted by Router E to Router F is forwarded by Router G to both Routers H and K, each of which replicates the packet and transmits it back to Router F. Each time the single packet traverses the loop, it doubles; in the first round, then, the single packet transmitted by Host E becomes two, in the second round it becomes four, in the third round it becomes eight, and so on. It should also be obvious this is only a problem in the case of multicast or broadcast traffic; unicast traffic wouldn’t normally be duplicated at Router F.
这些转发循环的限制器是什么?限制因素包括:
What would the limiter be for either of these forwarding loops? Limiters would include:
• 转发路径中任何设备转发流量的速度。
• The speed at which any device in the forwarding path can forward the traffic.
• 连接设备的链路的带宽。
• The bandwidth of the links connecting the devices.
• 传输到转发环路的数据包的生存时间。这通常不会成为太大的限制因素,因为任何生存时间通常都设置得足够高,以确保通过大直径网络的传输,而这几乎总是太高而无法防止转发环路造成的附带损害。
• The time to live of the packets being transmitted into the forwarding loop. This generally isn’t going to be much of a limiter, as any time to live is normally set high enough to ensure transport across a large diameter network, which is almost always too high to prevent collateral damage from a forwarding loop.
• 控制平面状态在转发环路所经过的链路上传递的可行性。
• The viability of the control plane state passing over the links through which the forwarding loop passes.
虽然其中大多数都是相当明显的,但最后一个限制因素可能不是——控制平面的可行性到底意味着什么?如果环路中涉及的链路上的流量变得足够大而导致数据包丢失,则控制平面将无法维持状态。当控制平面状态失败时,形成链路的可达性信息将(可能)从表中删除,从而导致循环“展开”。如果环路稳定,当控制平面重新学习环路中链路的可达性信息时,环路将重新形成。
While most of these are fairly obvious, the last limiter might not be—what precisely does the viability of the control plane mean? If the traffic along the links involved in the loop becomes large enough to cause packet drops, the control plane will not be able to maintain state. When the control plane state fails, the reachability information that formed the link will (likely) be removed from the table, causing the loop to “unwind.” If the loop is stable, it will be re-formed when the control plane relearns reachability information across the links in the loop.
网络中如何形成导致反馈循环的转发循环?一般来说,问题始于控制平面的某个地方。一般来说,当网络的实际状况不佳时,控制平面就会形成转发环路。与控制平面的网络视图不匹配。一些示例可能有助于理解这种情况如何以及何时发生。
How do forwarding loops that lead to feedback loops form in a network? Generally the problem begins someplace in the control plane. Generally speaking, control planes will form forwarding loops when the actual state of the network doesn’t match the control plane’s view of the network. Some examples might be helpful in understand how and when this happens.
图 8.6说明了两个路由协议之间相互重新分配的网络。
Figure 8.6 illustrates a network with mutual redistribution between two routing protocols.
在此网络中,路由器 A 将 2001:db8:0:1::/64 重新分配到 OSPF v3 中,发送给路由器 B 和 D,成本为 100。为了简化说明,取环路的一侧;路由器 B 将此目标重新分配到 EIGRP 中,度量值为 1000。然后,路由器 D 将 2001:db8:0:1::/64 作为 EIGRP 外部路由选取,并重新分配回 OSPF,成本为 10。路由器 D然后沿着RouterA、B、D共享的广播链路将该路由发布给RouterB;该路由作为外部 OSPF 路由到达路由器 B,成本为 10。因为通过路由器 D 的成本(10 加上 [B,D] 链路的成本)小于通过路由器 A 的路由成本(100 加) [B,D] 链路的成本 — 路由器 B 将选择通过路由器 D 到达此目的地的路径。每次在两个协议之间重新分配目的地时,这种类型的路由循环可能会因成本增加而自行超时,但一旦第一个重新分配周期完成,它就会很快重建。这种类型的路由循环也可以保持稳定,具体取决于选择重新分配度量的方式。
In this network, Router A is redistributing 2001:db8:0:1::/64 into OSPF v3 toward Routers B and D with a cost of 100. To simplify the explanation, take one side of the loop; Router B redistributes this destination into EIGRP with a metric of 1000. 2001:db8:0:1::/64 is then picked up as an EIGRP external route by Router D, and redistributed back into OSPF with a cost of 10. Router D then advertises this route back to Router B along the broadcast link shared by Routers A, B, and D; the route reaches Router B as an external OSPF route with a cost of 10. Because the cost through Router D—10 plus the cost of the [B,D] link—is less than the cost of the route through Router A—100 plus the cost of the [B,D] link—Router B will choose the path through Router D to reach this destination. This type of routing loop can time itself out through the increasing costs each time the destination is redistributed between the two protocols, but it will quickly be rebuilt once the first cycle of redistribution is completed. This type of routing loop can also remain stable, depending on the way the redistribution metrics are chosen.
这样做的原因是原始路由被重新分配到 OSPF 中,而不是简单地通告到 OSPF 中。2001:db8:0:1::/64 必须是两个入口点的外部路由才能创建循环;否则,路由器 B 和路由器 D 将拥有 1 条内部 OSPF 路由和 1 条外部 OSPF 路由,并且它们将始终优先选择内部路由而不是外部路由,从而打破环路。需要多个重新分配点才能导致这种类型的循环。
The reason this works is the original route is redistributed into OSPF, rather than simply being advertised into OSPF. 2001:db8:0:1::/64 must be an external route at both points of entry to create the loop; otherwise Routers B and D will have one internal OSPF route and one external OSPF route, and they will always prefer the internal route over the external, breaking the loop. Multiple points of redistribution are required to cause this type of loop.
人们很容易认为这里的问题是从一个路由域传递的信息被泄漏或重新分发到另一个路由域,然后又被重新分发回来——相互重新分发实际上是问题的根源。但是,虽然消除相互重分配或阻止重新分配的路由再次重分配(通过过滤器、标签、社区或其他机制)可以解决问题,但相互重分配并不是根本原因。
It’s tempting to say the problem here is that information passing from one routing domain is being leaked, or redistributed, into another routing domain, and then it’s being redistributed back again—that the mutual redistribution is actually the root of the problem. But while removing the mutual redistribution, or blocking redistributed routes from being redistributed again (though filters, tags, communities, or other mechanisms) will resolve the problem, mutual redistribution is not the root cause.
根本原因实际上是在重新分配过程本身中删除了有关网络状态的信息。回到路由基础知识,您可能还记得路由协议通过检查指标来确定到目的地的特定路径是否是环路。由于无法直接比较任何两个协议使用的度量,因此分配给重分配路由的度量必须以某种方式简单地配置或计算 - 无论如何完成此配置或计算,有关网络状态的信息(成本)在这种情况下到目的地的路径)在重新分配路由时可能会丢失。
The root cause is actually the removal of information about the state of the network in the redistribution process itself. Reaching back to routing fundamentals, you might recall that routing protocols determine whether a particular path to a destination is a loop by examining the metrics. Because the metrics of any two protocols use cannot be directly compared, the metric assigned to a redistributed route must simply be configured or calculated in some way—no matter how this configuration or calculation is done, information about the state of the network (the cost of the path to the destination in this case) will likely be lost when redistributing routes.
因此,当在两个不同的路由协议之间重新分配路由信息时,网络的实际状态与控制平面认为的网络状态之间会发生不匹配。每当发生这种不匹配时,就有可能出现路由环路,从而导致网络中出现(可能是永久性的)转发环路。
Hence, when routing information is redistributed between two different routing protocols, a mismatch between the actual state of the network and what the control plane believes to be the state of the network occurs. Any time such a mismatch occurs, there is the possibility of a routing loop, which then causes a (potentially permanent) forwarding loop in the network.
第 7 章“协议复杂性”花费了大量时间研究微循环和各种解决方案,特别是在增加协议复杂性方面,试图达到图灵曲线的“拐角”(参见第 1 章“定义复杂性”) 。使用图 8.7在转发循环的上下文中快速回顾一下这里将会很有用。
Chapter 7, “Protocol Complexity,” spent a good deal of time examining microloops and various solutions, specifically in terms of increasing protocol complexity in an attempt to reach the “corner” of the Turing Curve (see Chapter 1, “Defining Complexity”). A quick revisit here in the context of forwarding loops will be useful, using Figure 8.7.
假定该网络正在运行链路状态协议,并且路由器 D 到 2001:db8:0:1::/64 的最佳路径是通过路由器 C,[B,C] 处的故障将导致路由器 C 和路由器之间形成环路。 D 在Router C 收敛之后、Router D 收敛之前的时间内。第 7 章“协议复杂性”讨论了该问题的多种解决方案,但根本原因到底是什么?
Given this network is running a link state protocol, and Router D’s best path to 2001:db8:0:1::/64 is through Router C, a failure at [B,C] will cause a loop to form between Routers C and D during the time after Router C has converged and before Router D has converged. Chapter 7, “Protocol Complexity,” discussed a number of solutions for this problem—but what, really, is the root cause?
虽然最明显的问题是两个路由器应该同时重新计算,但实际的根本原因是现有的网络拓扑与路由器 D 认为的网络拓扑不匹配。这强化了这样的观点:只要存在的拓扑与控制平面的拓扑视图之间存在不匹配,就会发生不好的事情。
While the most obvious problem is the two routers should recalculate at the same time, the actual root cause is the mismatch between the network topology as it exists and the network topology as Router D believes it to be. This reinforces the point that any time there is a mismatch between the topology as it exists and the control plane’s view of the topology, bad things happen.
从这个角度来看,问题为何如此难以解决就显而易见了。CAP 定理指出,您可以设计一个一致、可用且具有分区容错性的数据库(也许应该是 CAT 定理,以纪念那种难以解释的生物——猫而命名?)——您必须在三个中选择两个。如果简单地将控制平面视为分布式实时数据库,那么路由协议必须放弃一些东西——应该是哪一个?
Seen from this angle, it becomes obvious why the problem is so difficult to solve. The CAP theorem states that you can design a database that is consistent, available, and exhibits tolerance for partitions (perhaps it should be the CAT theorem, named in honor of that inexplicable creature, the cat?)—you must choose two of the three. If the control plane is simply treated as a distributed real time database, the routing protocol must give up something—which one should it be?
对于分区的容忍度当然不是分布式路由协议可以放弃的;这将破坏其分布式本质的本质。可访问性是另一个不容置疑的领域。参与控制平面的每个设备都无法始终访问路由协议数据库,对于数据包转发任务来说确实没有多大用处。那么,一致性是分布式控制平面中必须让步的一点——正如第 7 章对微循环问题的探索所示,不存在任何问题。解决来自网络不一致视图的问题的简单方法(事实上,根据 CAP 定理,无论问题有多么复杂,都没有办法真正完全解决问题)。
Tolerance for partitioning is certainly not something a distributed routing protocol can give up; that would destroy the very essence of its distributed nature. Accessibility is another area that simply isn’t in doubt; a routing protocol database that isn’t accessible by every device participating in the control plane all the time really isn’t much use for packet forwarding duty. Consistency, then, is the point that must give way in a distributed control plane—and as Chapter 7’s exploration of the problem of microloops illustrates, there’s no simple way to resolve the problems that come from an inconsistent view of the network (in fact, according to CAP theorem, there’s no way to actually resolve the problem entirely, no matter how much complexity is thrown at the problem).
当然,还可以做出另一组选择——控制平面数据库可以集中化,从 CAP 定理中删除“P”,从而使“A”和“C”在理论上成为可能。然而,在现实生活中并不总是能解决这个问题,正如您将在第 10 章“可编程网络复杂性”中发现的那样。
Another set of choices can be made, of course—the control plane database can be centralized, removing the “P” out of the CAP theorem, and hence making the “A” and “C” theoretically possible. It doesn’t always work out this in real life, however, as you’ll discover in Chapter 10, “Programmable Network Complexity.”
了解了背景中的反馈循环后,是时候回到复杂的世界了。本节将从控制平面循环的另一个示例开始,然后继续讨论与网络工程相关的速度、状态和复杂性表面。
With this understanding of feedback loops in the background, it’s time to move back into the world of complexity. This section will begin with another example of a control plane loop, and then continue with a discussion of the speed, state, and surface of complexity in relation to network engineering.
生成树因级联控制平面故障而闻名(也许“臭名昭著”是一个更好的术语)。图 8.8说明了一个小型网络,可以通过该网络追踪此类故障。
Spanning Tree is famous—perhaps infamous is a better term—for cascading control plane failures. Figure 8.8 illustrates a small network over which it is possible to trace such a failure.
该网络以交换机 A 作为根桥开始,路径 [B,D] 被生成树协议阻止。假设有两件事同时发生:
This network begins with Switch A as the root bridge, and the path [B,D] blocked by the Spanning Tree Protocol. Assume that two things occur at the same time:
• 超大单向流正在通过交换网络,使用大量可用带宽,从交换机 A 通过交换机 B 和 C,到达交换机 E(以及交换机 E 之外的某个位置)。
• An outsized unidirectional flow is passing through the switched network, using a large amount of the available bandwidth, from Switch A through Switches B and C, to Switch E (and someplace beyond Switch E).
• 交换机 B 上的某些进程出现异常,导致桥接协议数据单元 (BPDU) 进程无法发送常规 hello 数据包。
• Some process on Switch B misbehaves, causing the Bridge Protocol Data Unit (BPDU) process to fail in sending regular hello packets.
如果生成树进程无法在交换机 B 处发送常规 BPDU,交换机 C 和 D 将开始选择新的根桥,从而确定通过网络的最短路径集,而不参考交换机 A 的当前状态作为根桥或交换机 B 的存在。这将导致 [B,D] 链路的流量畅通,因此:
If the spanning tree process fails to send regular BPDUs at Switch B, Switches C and D will begin to elect a new root bridge, and hence to determine the shortest set of paths through the network without reference to the current state of Switch A as the root bridge, or the existence of Switch B. This will result in the [B,D] link being unblocked for traffic, so:
• 源自A 的大流量将通过广播链路[B,C,D] 传输。
• The large flow originating at A will be transmitted across the broadcast link [B,C,D].
• 大流量将由交换机C 和D 传输到广播链路[C,D,E]。
• The large flow will be transmitted by both Switches C and D onto the broadcast link [C,D,E].
• 交换机D 将在其[C,D,E] 接口接收相同的大流量,并将其重新传输回[B,C,D] 广播链路,从而建立转发环路。
• Switch D will receive the same large flow at its [C,D,E] interface and retransmit it back onto the [B,C,D] broadcast link, setting up the forwarding loop.
一旦转发环路启动,唯一的限制因素将是网络中链路、接口和转发设备的饱和点。随着线路上流量的增加,生成树所依赖的沿网络拓扑形成最短路径树的 BPDU 将被丢弃,从而导致网络进一步分段为独立的交换机,每个交换机都选择自己的根桥。一旦网络达到这一点,除了一台(或多台)交换机发生故障之外,没有其他方法可以恢复,从而导致流量停止转发,并使生成树有机会通过网络重新建立无环路路径。
Once the forwarding loop is started, the only limiting factor will be the saturation point for the links, interfaces, and forwarding devices in the network. As the traffic on the wire builds, the BPDUs that Spanning Tree counts on to form a shortest path tree along the network topology will be dropped, causing the network to further fragment into independent switches each electing their own root bridge. Once the network reaches this point there is no way to recover other than the failure of one (or more) of the switches, causing the traffic to stop being forwarded, and giving Spanning Tree the chance to re-establish loop free paths through the network.
到目前为止给出的例子中有几点需要考虑;这里将使用状态/速度/表面模型将这些点转化为复杂性术语。
There are several points to be considered in the examples given thus far; the state/speed/surface model will be used here to put these points into complexity terms.
状态:在每种情况下,控制平面状态在启动或维持反馈循环中发挥着作用。问题是缺乏有关网络拓扑真实状态的信息。每当控制平面所认为的网络拓扑与现实世界中存在的实际网络拓扑不匹配时,就会出现一些转发问题。上面的示例使用数据包循环通过数据包丢失(生成树故障)来删除拓扑信息,并删除描述实际网络拓扑的控制平面状态(重新分配)来说明状态与现实之间的不匹配。
State: In each of these cases, control plane state plays a role in starting or maintaining the feedback loop. The problem is a lack of information about the true state of the network topology. Any time there is a mismatch between what the control plane believes about the network topology, and the actual network topology as it exists in the real world, there will be some forwarding issue. The examples above used a packet loop to remove topology information through packet loss (spanning tree failure), and the removal of control plane state describing the actual network topology (redistribution) to illustrate this mismatch between state and reality.
速度:控制平面中的信息丢失,导致网络与控制平面的网络视图之间不匹配,是此处给出的每个示例的主要原因。然而,重要的是要注意速度在每种情况下所扮演的角色。在微循环的情况下,控制平面的速度不够快,无法跟上网络拓扑的变化。然而,加速控制平面只会导致其他复杂性问题浮出水面——复杂性的世界里没有免费的午餐(或者更确切地说,你不能像设置一个计时器那样轻松地击败复杂性恶魔)让事情进展得更快)。
Speed: The loss of information in the control plane, resulting in a mismatch between the network and the control plane’s view of the network, is the primary cause for each of the examples given here. However, it is important to note the role speed plays in each situation, as well. In the case of microloops, the control plane isn’t fast enough to keep up with changes in the network topology. Speeding the control plane up, however, simply causes other complexity issues to surface—there is no such thing as a free lunch in the world of complexity (or, rather, you can’t beat the complexity demon as easily as just setting a timer to make things go faster).
另一种看待这个问题的方法是跷跷板:网络对拓扑变化的反应越快,就越有可能发现控制平面形成正反馈循环的情况。速度好,但速度也不好。如果没有考虑到当重量转移并且你的一侧开始向地面移动时会发生什么,就很容易推到跷跷板的一侧。当你“看到”而不是“看到”时,你会面对什么?底部有一个令人讨厌的肿块,或者更糟糕的东西?
An alternative way to look at this is as a see-saw: the fast the network reacts to topology changes, the more likely you are to find situations where the control plane forms a positive feedback loop. Speed is good, but speed is also bad. It’s far too easy to push to one side of the see-saw without thinking through what’s going to happen when the weight shifts, and your side starts heading toward the ground. What is it you’ll face when you “saw” rather than “see”? A nasty bump on the bottom, or something worse?
表面:刚才讨论的两个例子说明了连接两个不同系统的宽而深的表面的问题。在重分配的例子中,两种路由协议之间的表面可以说不够深入。由于协议之间没有完全交换信息,因此交互界面效率低下。此外,这种情况下的交互面太宽了;重新分发不需要在网络中超过最少数量的位置进行。减少重新分配发生的点可能会降低通过网络的流量效率,但这是针对控制平面状态的常见权衡。
Surface: Two of the examples just discussed illustrate the problem with broad, deep surfaces connecting two different systems. In the redistribution example, the surface between the two routing protocols can be said not to be deep enough. Because information isn’t fully exchanged between the protocols, the interaction surface is inefficient. Further, the interaction surface in this case is too broad; redistribution doesn’t need to take place at more than a minimal number of places in the network. Reducing the points at which redistribution does take place can potentially reduce the efficiency of traffic flow through the network, but this is a common tradeoff against control plane state.
在生成树故障的情况下,所看到的交互面是控制平面和流经网络的流量或数据平面之间。这通常不是网络工程师考虑的交互表面,但由于控制平面与数据平面位于相同的链路上(控制平面位于 band 中),因此需要考虑一个明确的交互表面。
In the case of the spanning tree failure, the interaction surface in view is between the control plane and the traffic flowing through the network, or the data plane. This isn’t normally an interaction surface network engineers think about, but because the control plane rides on the same links as the data plane (the control plane is in band), there is a definite interaction surface that needs to be considered.
第三个例子,微循环,以另一种方式说明了表面问题。控制平面可以说是一个系统,而网络拓扑可以说是一个单独的系统。因此,控制平面对拓扑状态的检测是一个交互面。在这种情况下,交互表面可以说太浅了,因为控制平面不能总是像实际发生的拓扑变化一样快地做出反应。这是看待上面段落中描述为状态和速度问题的问题的另一种方式。
The third example, microloops, illustrates the surface problem in another way. The control plane can be said to be one system, while the network topology can be said to be a separate system. The detection of the topology state by the control plane is therefore an interaction surface. In this case, the interaction surface can be said to be too shallow, in that the control plane cannot always react to changes in the topology as quickly as those changes actually occur. This is another way of looking at the problem described as a state and speed issue in the paragraphs above.
系统之间或系统内部的反馈循环很有趣,但一旦您知道自己在寻找什么,它们通常很容易被发现。然而,在任何给定的网络中,通常很难看到共同的命运——最常见的原因是共同的命运情况被故意隐藏在旨在降低网络表面复杂性的抽象层之下。什么是共命运问题?它如何影响现实世界中的网络故障?理解此类问题的最佳方法是通过示例。一旦考虑了这些例子,共同的命运问题将与本书中使用的复杂性模型(速度、状态和表面)相关联。
Feedback loops between systems or within a system are interesting, but they are often easy to spot once you know what you’re looking for. Shared fate, however, is often very difficult to see in any given network—most often because shared fate situations are intentionally buried under layers of abstraction designed to reduce the apparent complexity of the network. What is a shared fate problem, and how does it impact network failure in the real world? The best way to understand this type of problem is through examples. Once these examples have been considered, shared fate problems will be related back into the complexity model used throughout this book—speed, state, and surface.
虚拟电路并不新鲜——事实上,使用标记的数据包头将几个不同的“电路”放置到一条物理线路上几乎从一开始就在网络中很常见。从多路复用 T1 开始,通过帧中继工作,以及放置在以太网帧上的更现代的 802.1Q 和 802.1ad 标头,将单个以太网链路分解为多个虚拟拓扑,虚拟化已成为数据链路中的规则,而不是例外。协议。如今,工程师可以选择 VXLAN、MPLS 和许多其他技术来虚拟化其链路。
Virtual circuits aren’t new—in fact, the use of tagged packet header to put several different “circuits” onto a single physical wire have been common from almost the very beginning in networks. Starting with multiplexed T1’s, working through Frame Relay, and the more modern 802.1Q and 802.1ad headers placed on an Ethernet frame to break a single Ethernet link into multiple virtual topologies, virtualization has been the rule, rather than the exception, in data link protocols. Today, engineers can choose from VXLAN, MPLS, and many other technologies to virtualize their links.
虚拟化提供了许多好处,例如:
Virtualization provides many benefits, such as:
• 能够对另一个虚拟拓扑隐藏一个虚拟拓扑;一种信息隐藏形式,可减少覆盖控制平面中携带的状态,以及控制平面对拓扑变化必须做出反应的速度。
• The capability to hide one virtual topology from another; a form of information hiding that reduces the state carried in the overlay control plane, and the speed at which the control plane must react to changes in the topology.
• 将物理拓扑抽象为多个逻辑拓扑的能力,每个逻辑拓扑都有自己的特性,例如每跳行为。
• The capability to abstract the physical topology into multiple logical topologies, each with their own characteristics, such as per hop behaviors.
• 通过比最短路径更长的路径传输流量(以增加特定路径的长度)以满足特定业务或运营目标的能力。
• The capability to carry traffic across a longer than shortest path (to increase the stretch of a specific path) to meet specific business or operational goals.
然而,虚拟化也有一个缺点——SRLG。图 8.9说明了共享风险链接组。
There is a downside to virtualization, however—the SRLG. Figure 8.9 illustrates shared risk link groups.
假设如下:
Assume the following:
• 路由器A 和B 都是提供商X 的分界(或切换)点,该提供商正在销售两个城市之间的高速链路。
• Routers A and B are both demarcation (or handoff) points for Provider X, which is selling a high-speed link between two cities.
• 路由器C 和D 都是提供商Y 的分界(或切换)点,该提供商正在销售两个城市之间的高速链路。
• Routers C and D are both demarcation (or handoff) points for Provider Y, which is selling a high-speed link between two cities.
客户从提供商 X 购买虚拟电路,并从提供商 Y 购买备份链路,以连接两个提供商互连的两个城市中的设施。然后,反铲操作员开始对 E 路段的网络进行褪色(也许他们正在清理路边某处的排水沟)。
A customer purchases a virtual circuit from Provider X, and a backup link from Provider Y, to connect their facilities in the two cities that both providers interconnect. A backhoe operator then proceeds to fade the network at Link E (perhaps they are cleaning out a drainage ditch someplace alongside a road).
这将导致两个虚拟电路都失败,因为它们共享单个物理链路。
This will cause both of the virtual circuits to fail, as they share a single physical link.
客户怎样才能避免这种情况呢?两家提供商都不太可能准确解释其虚拟电路是如何逐跳配置的,这将提供可能对提供商的业务产生负面影响的竞争信息。从客户的角度来看,单个链路 E 的虚拟化创建了一个不可见的 SRLG,因为抽象的魔力,客户对此无能为力,只能使用单个提供商并要求这种类型的服务。网络中不会出现这种情况。
How can the customer avoid this situation? Neither provider is likely to explain exactly how their virtual circuit is provisioned, hop-by-hop—this would be giving out competitive information that could negatively impact the provider’s business. From the customer’s perspective, the virtualization of the single link, E, has created an SRLG that is not visible because of the magic of abstraction—and there’s little the customer can do about the situation other than use a single provider and demand this type of situation doesn’t arise in the network.
虚拟化抽象可以说是泄漏;较低逻辑层的状态以共享风险组的形式泄漏到较高逻辑层,其中单个故障会转化为多个不同的中断。
The virtualization abstraction can be said to leak; state at a lower logical layers leaks into higher logical layers in the form of shared risk groups, where a single failure translates into a number of different outages.
然而,共同命运问题不仅仅存在于虚拟化链路之上;还存在于虚拟化链路之上。任何时候使用虚拟化,都有可能出现命运共同体的情况。多个 TCP 流通过网络中一组缓冲区或队列的行为是共享命运问题的意外变体。图 8.10说明了通过网络的多个 TCP 流的行为。
Shared fate problems don’t just ride atop virtualized links, however; any time virtualization is used, there is the possibility of a shared fate situation developing. The behavior of multiple TCP flows passing through a single set of buffers or queues in a network is an unexpected variant of a shared fate problem. Figure 8.10 illustrates the behavior of several TCP flows passing through a network.
在此图中,三台主机 A、B 和 C 正在向只能通过单个 [D,E] 链路访问的另外三台主机发送三个不同的 TCP 流。检查D处的输出队列,有6个数据包等待传输;三个流中各有一个。如果此输出队列只能容纳三个数据包,则放入队列中的三个数据包将被丢弃。如果使用尾部丢弃机制,队列中最新的三个数据包将被丢弃,即P4、P5和P6。
In this diagram, three hosts, A, B, and C, are sending three different TCP streams to three other hosts that are only reachable across the single [D,E] link. Examining the output queue at D, there are six packets waiting to be transmitted; one from each of the three streams. If this output queue can only hold three packets, then three of the packets placed in the queue will be dropped. If a tail drop mechanism is used, the three newest packets in the queue will be dropped, which means P4, P5, and P6.
请注意,这三个数据包代表每个流中的一个数据包,因此三个 TCP 会话中的每一个都将同时进入慢启动模式。无论三个 TCP 会话是否使用相同的计时器来重建更大的窗口,当 [D,E] 链路达到其最大容量的一定百分比时,都会出现相同的情况,从而导致循环重新启动。结果是图右下角的锯齿利用率图表,显然是链路容量的次优使用。有几种方法可以看待这个问题。
Note that these three packets represent one packet from each stream, so each of the three TCP sessions will go into slow start mode at the same time. Whether or not the three TCP sessions are using the same timers to rebuild to a larger window, the same situation will arise when the [D,E] link reaches some percentage of its maximum capacity, causing the cycle to restart. The result is the sawtooth utilization chart on the bottom right of the figure—clearly a suboptimal use of the link capacity. There are several ways to look at this problem.
首先,这可以被建模为抽象泄漏的情况。每个 TCP 会话都将主机到主机的路径视为独占通道,但事实并非如此,该通道与其他 TCP 会话共享。然后,单个链接被抽象为三个链接,每个链接看起来都是独占链接。底层现实在路由器 D 的单个输出队列中泄漏,迫使上层与底层现实进行交互。
First, this can be modeled as a case of a leaky abstraction. Each TCP session sees the path from host to host as an exclusive channel, but it’s not—the channel is being shared with other TCP sessions. The single link, then, is abstracted into three links, each appearing to be an exclusive link. The underlying reality leaks through in the single output queue at Router D, forcing the upper layers to interact with the underlying reality.
第二,这个问题可以被视为相互作用表面的情况——泄漏抽象解释的“另一面”。在本书使用的复杂性理论模型中,速度、状态和表面是关键点。这里的问题是由底层传输系统([D,E]链路上的物理和数据链路)和覆盖传输系统(IP 和 TCP)之间的交互引起的。在这两个系统的交汇处,有一个它们交互的表面,特别是路由器 D 处的输出队列。那么,在此处使用的网络复杂性模型中,可以通过删除交互表面(使三个链接三个实际链接,而不是一个共享链接),或者通过添加更多深度和复杂性来解决交互的细微差别。
Second, this problem can be seen as a case of interacting surfaces—the “other side” of the leaky abstraction explanation. Within the model of complexity theory used throughout this book, speed, state, and surface are the key points. Here the problem is caused by the interaction between the underlying transport system (the physical and data link on the [D,E] link) and the overlay transport system (IP and TCP). At the intersection of these two systems there is a surface along which they interact—specifically the output queue at Router D. Within the model of network complexity used here, then, this problem can be solved by either removing the interaction surface (making the three links three actual links, rather than one shared link), or by adding more depth and complexity to resolve the nuances of the interaction. Of course there will be tradeoffs in adding this depth to the interaction surface.
加权随机早期检测(WRED)是用于解决此问题的一种机制。通过从队列中随机丢弃数据包,而不是总是丢弃添加到队列中的最后一组数据包,TCP 进入慢启动的影响会分散到不同会话的时间上,从而阻止各个流的原始同步。然而,WRED 可能会对跨此类队列运行的单个流产生副作用,并且通常很难精确调整 WRED 来管理仅持续几秒钟的微小(鼠标)流以及沿着队列的较大、寿命较长(大象)流。相同的链接。将老鼠流和大象流移动到网络中的不同链路上通常很有用,只是为了减轻 TCP 同步和服务质量缓冲区问题的影响。因此,为了解决特定交互表面中的一个问题,复杂性会向外扩散,就像扔进水中的石头所产生的小波一样。好吧,这很哲学,但你明白了。
Weighted Random Early Detection (WRED) is one mechanism used to resolve this problem. By randomly dropping packets off the queue, rather than always dropping the last set of packets added to the queue, the impact of TCP going into slow start is spread across time on different sessions, preventing the original synchronization of the various streams. However, WRED can have side effects on a single stream running across such a queue, and it’s often difficult to precisely tune WRED to manage both tiny (mouse) flows that only last a few seconds alongside larger, long lived (elephant) flows along the same link. It’s often useful to move mouse and elephant flows onto different links in a network just to mitigate the effects of TCP synchronization and quality of service buffer issues. So in an attempt to resolve one problem in a specific interaction surface, complexity ripples outward like the little wavelets expanding from a stone thrown in the water. Okay, that was very philosophical, but you get the point.
最后回到本节的主题,这个问题可以看作是一个命运共同体问题。在这种情况下,您可以将三个 TCP 流建模为跨公共物理基础设施运行的虚拟电路。路由器 D 上的输出队列是所有三个流都必须通过的共享资源,并且该队列的命运由所有三个流共享。
Finally, to return to the theme of this section, this problem can be seen as a shared fate issue. In this case, you can model the three TCP streams as virtual circuits running across a common physical infrastructure. The output queue at Router D is a shared resource through which all three streams must pass, and the fate of that queue is shared by all three flows.
这个具体的例子很有用,因为相对容易将所有三个视角(泄漏抽象、交互表面和共同命运)视为对同一潜在问题进行建模的不同方法。这三个模型中的每一个都可以提出一组不同的解决方案,甚至可以向您指出任何给定解决方案的问题。
This specific example is useful because it’s relatively easy to see all three perspectives—leaky abstractions, interaction surfaces, and shared fate—as different ways to model the same underlying problem. Each of these three models can suggest a different set of solutions, or even point you to problems with any given solution.
本章涵盖了复杂性与失败之间相互作用的两个广泛领域:反馈循环和共同命运。这两种情况都可以映射到第 1 章“定义复杂性”中首次描述的复杂性模型,并在整本书中使用:
This chapter has covered two broad areas of the interaction between complexity and failure: feedback loops and shared fate. Both of these situations can be mapped to the complexity model first described in Chapter 1, “Defining Complexity,” and used throughout the book:
状态:网络的实际状态与控制平面看到的状态不匹配是控制平面形成微环的最终原因,而通过各种手段去除实际拓扑的状态往往是导致控制平面形成微环的根本原因。在控制平面中形成更永久转发循环的最终原因。在共享命运的情况下,单个共享资源的状态通过协议栈中上层创建的抽象泄漏,从而影响上层协议的操作。
• State: The mismatch between the actual state of the network and the state as viewed by the control plane is the ultimate cause of microloops formed through the control plane, while the removal of state about the actual topology through a variety of means is often the ultimate cause of more permanent forwarding loops formed in the control plane. In the case of shared fate, the state of a single shared resource leaks through the abstraction created by upper layers in the protocol stack to impact the operation of the upper layer protocol.
•速度:网络实际状态的变化速度可能会压垮控制平面,或者发生得足够快以致控制平面无法实时做出反应。实际上,所有控制平面都是接近实时的,而不是实时的,以提供足够的“缓冲”,以防止控制平面在快速变化的情况下发生故障。
• Speed: The speed of change in the actual state of the network can overwhelm the control plane, or happen quickly enough that the control plane cannot react in real time. In reality, all control planes are near real time, rather than real time, to provide enough “buffer” to prevent a control plane failure in the case of rapid changes.
•表面:拓扑和控制平面之间的交互表面在正反馈循环和许多共享命运问题(例如 TCP 同步)中发挥着重要作用。
• Surface: The interaction surface between the topology and the control plane plays a large role in positive feedback loops and in many shared fate problems, such as TCP synchronization.
本章讨论了故障的根本原因,但这是工程师需要谨慎对待的概念。复杂的系统,特别是高度冗余和可用的系统,总是处于伪故障模式。任何真正复杂的系统中总会有某个地方出现问题。用于管理和控制故障影响的技术通常足以防止整个系统故障,因此整个系统不会受到影响。作为其结果(或推论),当发生重大或系统性故障时,通常有多个原因。一种常见的情况是,某些伪故障集与拓扑、控制平面或转发平面中的单个转变相结合,充当催化剂,从而导致系统故障。
This chapter has discussed root causes for failures, but this is a concept engineers need to be cautious with. Complex systems, particularly highly redundant and available ones, are always in a pseudo-failure mode. There is always something wrong someplace in any truly complex system. The techniques used to manage and control the impacts of failures are generally sufficient to prevent a total system failure, so the overall system is not impacted. As a result (or corollary) of this, when a major, or systemic, failure occurs, it usually has more than one cause. A common situation is the combination of some set of pseudo-failures with a single shift in the topology, control plane, or forwarding plane that acts as a catalyst, resulting in a systemic failure.
这意味着,在管理大型复杂系统中的系统故障或进行事后分析时,寻找单一根本原因通常会适得其反。相反,事后分析应该做的是尝试找到原始的条件、催化剂、反馈循环以及导致从一次改变到最终失败的交互界面。在复杂的系统中,故障不是单一的事情,它本身就是一个复杂的系统。
This implies that the search for a single root cause is generally counterproductive in managing or doing postmortems of systemic failures in large complex systems. What postmortems should do, instead, is attempt to find the original conditions, the catalyst, and the feedback loops and interaction surfaces that led from the one change to the final failure. In a complex system, a failure isn’t a single thing—it’s a complex system in and of itself.
还需要考虑人性化的一面。罗伯特·库克对人性的一面进行了深思熟虑的表达:
There is a human side to consider, as well. A well thought out expression of this human side is given by Robert Cook:
基于“根本原因”等推理的评估并不反映对失败本质的技术理解,而是反映将结果归咎于特定的、局部的力量或事件的社会、文化需要。。。。这意味着事后事故对人的表现的分析是不准确的。1
The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcomes. . . . This means that ex post facto accident analysis of human performance is inaccurate.1
1 . Richard I. Cook,复杂系统如何失败(认知技术实验室,2000 年)。
1. Richard I. Cook, How Complex Systems Fail (Cognitive Technologies Laboratory, 2000).
构建和管理大型复杂系统的组织需要专注于发现和解决问题,而不是仅仅找出根本原因,然后解决问题,包括回到最初推动复杂性产生的业务驱动因素。解决复杂性带来的持续失败的一种可能的解决方案是记住复杂性是解决难题的副作用。有时,唯一真正的解决方案、真正降低复杂性的唯一方法就是降低问题的难度。
Instead of just seeking out a root cause and then fixing the blame, organizations that build and manage large complex systems need to focus on finding and fixing the problem—including reaching back into the business drivers that drove the creation of complexity in the first place. One possible solution to constant failure through complexity is to remember that complexity arises as side effect of solving hard problems. Sometimes the only real solution, the only way to truly reduce complexity, is to reduce the hardness of the problems.
对于单个工程师来说,什么技能最适合管理大型复杂系统中的故障?这个问题有三个答案:
For the individual engineer, what skills are best for managing failure in a large complex system? There are three answers to this question:
• 能够了解在任何特定设计或部署中反馈循环在何处以及如何形成。正反馈回路可能是控制平面故障领域中最具破坏性的力量。作为一名网络工程师多年,我几乎总能指出正反馈循环是几乎所有控制平面故障的根本原因(或主要根本原因)的一部分。
• The ability to see where and how feedback loops can form in any particular design or deployment. Positive feedback loops are probably the most destructive force in the world of control plane failures; in many years as a network engineer, I can almost always point to a positive feedback loop as part of the root cause—or the primary root cause—in virtually every control plane failure.
• “穿透”抽象并“看到”现实世界中存在的交互表面的能力。能够以多种不同的方式对这些交互表面进行建模可以帮助工程师真正理解手头的问题,并在复杂性权衡的范围内找到一个好的解决方案。或者简单地忽略确定给定权衡的问题只会压倒任何可能的解决方案。
• The ability to “reach through” abstractions, and to “see” the interaction surfaces, as they exist in the real world. Being able to model these interaction surfaces in a number of different ways can help the engineer to really understand the problem at hand, and to find a good solution within the realm of complexity tradeoffs. Or to simply ignore the problem on determining the tradeoffs given will simply overwhelm any possible solution.
• 获得失败的经验。许多技能,例如骑自行车,必须通过身体经验来学习。管理大型复杂系统的故障是其中一项技能,就像骑自行车一样,它也是一项一旦学得好就不容易忘记的技能。在现实世界中,除了让新工程师处理较小的故障、教他们培养故障管理技能所需的思维和工程意识,然后让他们练习之外,没有简单的方法来培养这些技能。对于那些依赖大型复杂系统来学习的人来说,这是一个惨痛的教训,但却是必要的。
• Gaining experience with failure. Many skills, like riding a bicycle, must be learned through physical experience. Managing failure on large complex systems is one of those skills—and like riding a bike, it’s also a skill not easily forgotten once well learned. There is no simple way to develop these sorts of skills in the real world other than putting new engineers on smaller failures, teaching them the thinking and engineering sense needed to develop failure management skills, and then letting them practice. This is a hard lesson for those who depend on large complex systems to learn, but it’s a necessary one.
为什么不直接解决这些问题呢?因为复杂性不允许有“灵丹妙药”。图灵曲线总是能赶上任何提出的解决方案,无论考虑得多么周全。即使是编织得最紧密的网,意想不到的后果和漏洞抽象也总能找到办法潜入。
Why not just solve these problems? Because complexity doesn’t allow for a “silver bullet.” The Turing Curve always catches up with any proposed solution, no matter how well thought out. Unintended consequences and leaky abstractions will always find a way to sneak through even the most tightly woven net.
从某种意义上说,网络从一开始就是可编程的。BGP 社区最初由托尼·李 (Tony Li) 和雅科夫·雷克特 (Yakov Rekhter) 在华盛顿特区的一家酒吧绘制的两张餐巾纸上勾画出来,在路由控制平面内承载着复杂的政策。虽然许多路由协议(包括 OSPF、IS-IS 和 EIGRP)中的标记功能对于简单任务非常有用,但将完整策略集附加到单个前缀的能力增加了整个范围的功能。除了对网络进行编程之外,还可以考虑基于策略的路由和流量工程吗?现代网络的可编程性(截至发稿时,网络技术的变化几乎与男士领带的宽度和裙摆的长度一样快)有什么不同?DevOps 和软件定义网络 (SDN) 有何不同,是什么推动了可编程网络的发展?这个问题有几种可能的答案:
In a sense, networks have been programmable from the beginning. BGP communities, originally outlined on two napkins drawn at a Washington DC bar by Tony Li and Yakov Rekhter, enshrined carrying complex policy within a routed control plane. While the tagging capability in many routing protocols, including OSPF, IS-IS, and EIGRP were useful for simple tasks, the ability to attach a full policy set to a single prefix added an entire range of capabilities. What else can policy-based routing and traffic engineering be considered other than programming the network? What’s different with the modern drive to make networks programmable (at press time—networking technology changes almost as quickly as the width of men’s ties and the length of hemlines)? How are DevOps and software-defined networks (SDNs) different, and what’s driving this movement toward a programmable network? There are several possible answers to this question:
• 用于与控制和转发系统交互的特定机制。
• The specific mechanism used to interact with the control and forwarding systems.
• 业务变化率。
• The rate of business change.
• 技术的感知商业价值不断变化。
• The changing perceived business value of technology.
• 分布式控制平面的复杂性。
• The perceived complexity of distributed control planes.
本章为第 10 章“可编程网络复杂性”奠定了基础。它从驱动程序和定义开始,因为这些是网络工程师在考虑如何以及在何处部署网络可编程性时进行权衡的基础。第二部分考虑用例网络可编程性为业务驱动程序和定义增添了一些实质内容。然后,第三部分通过检查几个建议的接口来考虑 SDN 前景。
This chapter provides the groundwork for Chapter 10, “Programmable Network Complexity.” It begins with the drivers and a definition, as these underlie the tradeoffs network engineers make when considering how and where to deploy network programmability. The second section considers use cases for network programmability to put a little flesh on the bones of the business drivers and definition. The third section then considers the SDN landscape by examining several proposed interfaces.
几年前,一家国际服务提供商遭遇了大规模的网络中断。作为回应,该提供商的管理层要求 16 名知名网络设计和升级工程师(来自全球升级团队的杰出工程师和工程师)聚集在一个办公室,“尽其所能”工作,重新设计他们的网络,以便永远不会再失败。正如一位参与其中的工程师所说:“需要很长时间才能提出如此完美的设计。我们要做的是一个人在白板上写字,十五个人擦除。” 这个故事直接转移到定义可编程网络——如果你把十六个工程师放在一个房间里并要求他们定义“可编程网络”,你最终会得到一次写入和十五次擦除。
A number of years ago, an international service provider suffered a large-scale network outage. In response, the provider’s management demanded that 16 well-known network design and escalation engineers—distinguished engineers and engineers from the global escalation team—gather in a single office and work “as long as it took,” to redesign their network so it would never fail again. As one of the engineers involved remarked: “It’s going to take a long time to come up with such a perfect design. What we’re going to have is one person writing on the white board, and fifteen erasing.” This story transfers directly to defining the programmable network—if you put sixteen engineers in a room and ask them to define “the programmable network,” you’ll end up with one writing and fifteen erasing.
然而,由于您无法擦除本书的页面,因此本节将尝试回答以下问题:什么是可编程网络?——尽管在完成本节之前您可能会想在计算机屏幕上使用修正带。定义的最简单途径是通过可编程网络的驱动程序。这里将讨论驱动程序的两个子集:
Because you can’t erase the pages of this book, however, this section will attempt to answer the question, what is a programmable network?—though you might be tempted to use correction tape on your computer screen before this section is done. The easiest path to a definition is through the drivers for programmable networks. Two subsets of drivers will be discussed here:
• 业务驱动因素
• Business drivers
• 集权化的潮起潮落
• The ebb and flow of centralization
每项业务都是信息业务。
Every business is an information business.
如果大多数企业认为自己不从事信息业务,那并不重要,因为他们确实从事信息业务。事实上,业务一直与信息有关,尽管多年来信息的焦点已经发生了变化。首先,有关于技术(手工艺)的信息,然后是关于谁可以在哪里进行贸易(商业系统)值得信赖的信息,然后是技术的回归(制造系统),等等。在当今世界,每个人脑海中浮现的信息都是关于客户的——他们是谁,他们想要什么,以及我如何以一种能引起他们注意的方式与他们“交谈”?
It doesn’t matter if most businesses believe they’re not in the information business—they are. In fact, business has always been about information, though the focus of the information has shifted over the years. First, there was information about technique (craftsmanship), then about who could be trusted where for trade (mercantile systems), then there was a return to technique (manufacturing systems), and so on. In today’s world, the information that rises to the top of everyone’s mind is about the customer—who are they, what do they want, and how do I “talk” to them in a way that will get their attention?
旧信息经济与新信息经济之间的一个关键区别在于企业领导者想要了解的事物变化的速度。制造系统、技术、配方、流程和贸易伙伴的变化比最新时尚要慢;客户的愿望就像暴风雨中的风一样改变方向。随着推动业务发展的信息速度不断加快,企业领导者自然会寻找管理这些信息的方法。正如吉尔·戴奇 (Jill Dyche) 指出的那样:“随着大数据等趋势的出现,商界人士更倾向于通过信息的视角来看待联系,整合关键数据,以形成单一的业务视图。” 1在一组相当静态的系统(包括网络)上处理和生成信息非常困难。
A key difference between the old information economy and the new is the speed at which the things business leaders want to know is changing. Manufacturing systems, techniques, formulas, processes, and trading partners change more slowly than the latest fad; customer desires change direction like the wind in a storm. As the information driving the business increases in speed, it’s only natural for business leaders to look for ways to manage that information. As Jill Dyche points out: “With the advent of trends like big data, businesspeople are more apt to see connection through the lens of information, consolidating key data in an effort to reach a single view of the business.”1 Handling and producing information on a fairly static set of systems (including the network) is very difficult.
1 . Jill Dyche,《新 IT:技术领导者如何在数字时代实现业务战略》(McGraw-Hill,2015 年),np
1. Jill Dyche, The New IT: How Technology Leaders Are Enabling Business Strategy in the Digital Age (McGraw-Hill, 2015), n.p.
虚拟拓扑可以以更快的方式快速有效地构建在物理底层之上,以响应新的应用程序和要求。虽然底层在设计和范围上可以保持相当恒定(通常通过使用横向扩展范式来实现,允许相当快地添加新容量),但可以为不同的应用程序或业务组启动和管理新的虚拟拓扑,以满足短期或长期的需求当术语需求出现时。
Virtual topologies can be built quickly and efficiently on top of a physical underlay in a much faster way to respond to new applications and requirements. While the underlay can remain fairly constant in design and scope (generally achieved by using a scale out paradigm allowing new capacity to be added fairly quickly), new virtual topologies can be spun up and managed for different applications or business groups to meet short or long term needs as they arise.
对于传统的分布式控制平面来说,定期启动此类覆盖是很困难的。可编程网络允许更快地构建和管理覆盖网络,从而提供以业务速度管理信息的能力。
Spinning up such overlays on a regular basis is difficult with traditional distributed control planes. Programmable networks allow overlays to be built and managed more quickly, providing the ability to manage information at the speed of business.
然而,推动收入增长(增加收入)只是可编程网络等式的一方面。手动配置网络需要大量时间、精力和专业知识,并且经常会因人为错误而导致许多错误。手动在大量设备上重复相同的配置将不可避免地导致某些地方出现错误。然而,大规模重复常见任务对于自动化来说已经成熟,可以节省时间,并通过减少 MTBM 来提高网络的可用性。自动化这些任务可以通过减少运行网络所需的运营支出来降低利润。
Driving the top line up—increasing revenues—is only one side of the equation for programmable networks, however. Manually provisioning networks requires a lot of time, effort, and expertise—and can often lead to a lot of mistakes just through human error. Repeating the same configuration across a large number of devices manually will inevitably result in some mistake being made someplace. Large-scale repetition of common tasks is, however, ripe for automation, saving time, and making the network more available by reducing the MTBM. Automating these tasks can drive the bottom line down by reducing the OPEX required to run the network.
可编程网络降低利润的另一种方法是提高网络的整体利用率。例如,流量通常遵循每日、每小时、每周和季节性模式。由于将流量快速转移到利用率较低的链路非常困难,网络运营商经常过度建设其网络。选择带宽以支持单个链路上的最高流量。在非高峰时段,该带宽不会被使用,因此是一种没有投资回报的成本。然而,如果网络是可编程的,则可以近乎实时地设计流量以适应网络上不断增加的负载,或者网络可以向某些应用程序发出信号(例如备份作业)等待另一个时间发送任何流量。以这种方式提前规划带宽使用情况称为带宽日历。虽然日历不能取代有效带宽规划的需要,但它可以将重点从构建峰值负载转移到构建更接近较长时间段内平均负载的数字。
Another way programmable networks can drive the bottom line down is by increasing the overall utilization of the network. For instance, traffic often follows daily, hourly, weekly, and seasonal patterns. Because of the difficulty of rapidly moving traffic around to lower utilized links, network operators often overbuild their networks. Bandwidth is chosen to support the highest traffic across a single link. During non-peak times, this bandwidth is not used, and is therefore a cost that has no return on investment. If the network is programmable, however, traffic can be engineered in near real time to adjust to increasing load on the network—or the network can signal some applications (such as a backup job) to wait until another time to send any traffic at all. Planning bandwidth usage in advance in this way is called bandwidth calendaring. While calendaring cannot replace the need for effective bandwidth planning, it can move the emphasis away from building for peak load and toward building for a number closer to the average load across longer time periods.
最后,许多网络运营商将可编程网络视为将硬件投资与软件或系统投资分开的一种方式。在传统网络中,控制平面以及网络的基本架构与硬件密切相关。购买特定供应商的设备会将运营商与特定的功能集联系起来。供应商经常以“足够标准”来打造精美艺术,以声称具有互操作性,同时包含可产生尽可能最高程度锁定的功能。从供应商的角度来看,这些“锁定功能”实际上是创新和附加值的例子。
Finally, many network operators see programmable networks as a way to separate their hardware investment from their software, or systems, investment. In traditional networking, the control plane—and therefore the basic architecture of the network—is intimately tied to the hardware. Purchasing a particular vendor’s equipment ties the operator to a particular feature set. Vendors have often made a fine art out of being “just standard enough” to claim interoperability, while including features that produce the highest degree of lock in possible. From the vendor’s perspective, these “lock in features” are actually examples of innovation and added value.
许多运营商将可编程网络视为摆脱与供应商之间这种反复纠缠的一种方式。可编程网络通常能够提供与供应商特定的“锁定功能”相同水平的服务,同时避免实际的供应商锁定。这反过来又可能降低资本支出(CAPEX),供应商被迫分别在硬件和软件功能上进行竞争。
Programmable networks are seen, by many operators, as a way out of this back and forth with vendors. A programmable network might often be able to provide the same level of service as vendor-specific “lock in features,” while avoiding the actual vendor lock in. This could, in turn, drive down Capital Expenditures (CAPEX), and vendors are forced to compete on hardware and software features separately.
当然,对可编程性的热潮部分是信息技术世界中集中化与分散化的正常潮起潮落的一部分。第一批台式计算机被塞进壁橱里,对 IT 部门来说是隐藏的。数据首先通过终端仿真卡连接到大型机的台式计算机上的屏幕抓取器从大型机拉入 Lotus 123 和其他软件包。随着数据在整个组织中分布,IT 试图通过迷你、结构化查询语言 (SQL) 和中间件重新获得控制权。截至发稿时,集中化再次风靡,这次是云计算。是什么驱动了这些周期?表 9.1讨论了一些。
Part of the rush to programmability is, of course, part of the normal ebb and flow of centralization versus decentralization in the world of information technology. The first desktop computers were stuffed into closets, hidden from the IT department. Data was first pulled off mainframes into Lotus 123 and other software packages through screen scrapers on desktop computers connected to the mainframe through terminal emulation cards. As data was distributed through the organization, IT tried to regain control through minis, Structured Query Language (SQL), and middleware. At press time, centralization is the rage again, this time with cloud computing. What drives these cycles? Table 9.1 discusses a few.
对于通用计算来说,集中化和分散化的潮起潮落是显而易见的,但是网络控制平面呢?回顾网络的历史,可以观察到同样的循环。电话网络是“原始”网络,由一小群人操作,他们了解如何布线以进行手动交叉连接,并使用旋转继电器管理中央办公室,旋转继电器将根据顺序电路中断建立连接。随着时间的推移,这些集中管理系统实现了自动化,形成了大规模的专用小交换系统。
The ebb and flow of centralization and decentralization is obvious enough for general computing, but what about network control planes? Looking back across the history of networks, the same sort of cycle can be observed. Telephone networks were the “original” networks, operated by a small army of people who understood how to wire cables for manual cross connects, and to manage central offices with rotary relays that would make connections based on sequential circuit interruptions. These centralized management systems were automated over time, resulting in large-scale Private Branch Exchange systems.
公共电话网络最终陷入了与分组交换网络的竞争,采用了信令系统 7 中 IP 和其他分组交换网络中使用的分布式控制平面的原理。分布式网络最终接管了电话网络的角色,大部分语音进行IP 之上,由分布式控制平面推动。对可编程网络的推动也可以被视为再次对集中控制的推动。
The public telephone network eventually fell into competition with packet switched networks, adapting principles of the distributed control planes used in IP and other packet switched networks in Signaling System 7. Distributed networks eventually took over the role of the telephone network, with most voice carried on top of IP, facilitated by distributed control planes. The drive toward programmable networks can also be seen as a drive toward centralized control once again.
表 9.1考虑了中央存储库与分散计算和存储设备之间信息潮起潮落的一些原因。是什么推动了控制平面的集中和分散的潮起潮落?也许最强大的驱动因素是感知的复杂性。
Table 9.1 considered some of the reasons for the ebb and flow of information between central repositories and decentralized compute and storage devices. What drives the ebb and flow of centralization and decentralization in the control plane? Perhaps the strongest driver is perceived complexity.
当分布式系统成为常态时,网络工程师与这些分布式系统固有的复杂性建立了密切的个人关系。由于国外在中心化系统方面的经验很少,因此围绕如何通过中心化构建更简单系统的理论如雨后春笋般涌现,并出现了通过中心化来简化网络的趋势。在中心化为规则、去中心化为例外的时期,情况恰恰相反——去中心化似乎比目前使用的中心化混乱要简单得多,因此市场朝着去中心化系统的简单性方向发展。
When distributed systems are the norm, network engineers develop a close and personal relationship with the complexity inherent in those distributed systems. Because there is little experience abroad with centralized systems, theories around how to build a simpler system through centralization grow like mushrooms, and there’s a move toward centralization to simplify networks. In periods where centralization is the rule, and decentralization the exception, the opposite occurs—decentralization appears much simpler than the centralized mess currently in use, and so the market moves toward the perceived simplicity of a decentralized system.
当然,复杂性并不总是只是被感知到的,为了构建支持数据和处理集中化所需的大型网络,需要某种形式的网络可编程性。如此庞大的网络规模需要自动化流程来配置和管理数千个物理设备及其连接,以及支持在单个基础设施上运行的大量客户或应用程序所需的虚拟机和虚拟网络覆盖。
Complexity isn’t always just perceived, of course—to build the large networks required to support the centralization of data and processing, some form of network programmability is required. The sheer scale of such large networks requires automated processes to configure and manage thousands of physical devices and their connectivity, along with the virtual machines and virtual network overlays required to support a large number of customers or applications running on a single infrastructure.
那么,这些就是网络可编程性的驱动因素:
These, then, are the drivers for network programmability:
• 提高网络适应业务需求的能力,而业务需求在消费者驱动的世界中变化得更快。
• Increasing the ability of the network to change to business requirements, which are, in turn, changing more quickly in a consumer-driven world.
• 通过自动化许多原本需要人才和时间的流程,降低管理大型网络所需的运营成本。
• Decreasing the OPEX required to manage large-scale networks by automating many processes that would otherwise require talent and time.
• 通过将软件与硬件分离,降低购买和构建大规模网络所需的资本支出。这允许网络架构与供应商驱动的架构分离,或者更确切地说,与明确绑定到特定金属板的硬件设备模型分离。这有两个效果:第一种是简单地让硬件与硬件竞争,软件与软件竞争,将通常被视为一个系统的系统分解为独立竞争的单独部分。第二是允许硬件和软件作为单独的系统进行管理,每个系统都有自己的生命周期。在这种模型中,更换硬件不需要对网络运行方式进行新的设计或架构概念。
• Decreasing the CAPEX required to purchase and build large-scale networks by separating software from hardware. This allows the architecture of the network to be separated from vendor-driven architectures, or rather the appliance model of hardware explicitly tied to a particular piece of sheet metal. This has two effects; the first is simply to put hardware in competition with hardware and software in competition with software, reducing what is normally seen as a system into separate pieces that compete independently. The second is to allow the hardware and software to be managed as separate systems, each with their own lifecycles. Replacing hardware, in this model, doesn’t necessitate a new design or architectural conception of the way the network operates.
• 数据从分布式计算机转向更集中的计算和存储资源。
• The ebb of data from distributed computers and into more centralized compute and storage resources.
• 相信集中/分散栅栏的另一边的草更绿。
• The belief that the grass is greener on the other side of the centralized/decentralized fence.
• 如果我“在这里发明它”,它一定会更好,这一信念也经常以某种方式发挥作用。在某些情况下,成为“特殊的雪花”或打造独角兽的愿望常常驱使工程师改变周围的事物“只是为了与众不同”。
• The belief that if I “invent it here,” it must be better often plays into the equation in some way, as well. The desire to be a “special snowflake,” or to build a unicorn, often drives engineers to change things around “just to be different,” in some number of cases.
根据迄今为止审查的驱动程序,如何定义网络可编程性?虽然没有完美的定义,但一个好的工作定义可能是:
Based on the drivers reviewed up to this point, how can network programmability be defined? While there is no perfect definition, a good working definition might be:
当控制平面和数据平面提供允许通过机器可读数据驱动的 API 修改和监视网络状态的接口时,网络是可编程的。
A network is programmable when the control and data planes provide an interface that allows the state of the network to be modified and monitored through a machine readable data-driven API.
该定义的范围很广,几乎涵盖了允许跨网络设备阵列监视和修改状态的所有机制。需要注意以下几点:
The scope of this definition is wide, encompassing almost any mechanism that allows the monitoring and modification of state across an array of network devices. Several points should be noted:
• 命令行界面(CLI)、图形用户界面和其他为与人类交互而设计的界面不包括在该定义中。
• Command-line interfaces (CLIs), Graphical User Interfaces, and other interfaces designed for interaction with humans would not be included in this definition.
• 不包括标准化;尽管标准化很重要,但具有一组接口的单一供应商网络仍然符合资格。从网络设计和管理的角度来看,这种类型的解决方案可能并不理想(甚至不可接受)。供应商中立性可以通过可编程网络来实现,但这并不是一定的结果。另一方面,如果网络设备上没有可编程接口,供应商中立几乎不可能实现。
• Standardization is not included; while standardization is important, a single vendor network with a single set of interfaces would still qualify. This type of solution might not be ideal (or even acceptable) from the network design and management perspective. Vendor neutrality might be achieved through programmable networks, but it’s not a certain outcome. On the other hand, vendor neutrality is going to be almost impossible to achieve without programmable interfaces on networking devices.
• 状态既是控制平面又是数据平面。可以围绕可编程 API 的更改构建完整的分类法(有关此类分类法,请参阅《网络架构艺术》中的第 17 章“软件定义网络”,思科出版社,2013 年)。
• The state is both the control and the data plane. An entire taxonomy can be built around what the programmable API changes (for such a taxonomy, see Chapter 17, “Software Defined Networks,” in The Art of Network Architecture, Cisco Press, 2013).
•管理网络和网络编程之间需要有所区别。信息流入和流出控制平面和数据平面的速度并不能可靠地衡量任何特定操作属于哪个类别。这里使用的定义是管理与设备相关,网络可编程性与控制或数据平面相关。这不是一个完美的分割,但它对于接下来的讨论很有用。
• There needs to be some differentiation between managing a network and programming a network. The speed at which information flows into and out of the control and data planes isn’t a reliable measure of which category any particular action belongs on. The definition used here is that management relates to devices, network programmability relates to the control or data plane. This isn’t a perfect split, but it’s useful for the discussions that follow.
笔记
Note
关于在此处的定义中包含 CLI 的屏幕抓取机制肯定存在一些问题。关于 CLI 需要注意两点。首先,屏幕抓取应用程序的复杂性比真正的机器界面的复杂性要大得多。屏幕抓取工具必须跟上每个设备的输出和输入格式的每一个可能的变化;处理每台设备的原始数据模型,即使它们都不同,并且可能会定期更改,也省去了一个步骤脱离处理。可以说,从屏幕抓取转向真正的可编程界面消除了流程中最复杂和最难维护的部分。其次,屏幕抓取是一种不完美的适应方法,适用于小规模问题,但并不能真正提供网络的整体视图,也不能以相同的方式进行扩展。由于这些原因,屏幕抓取并未作为可编程接口包含在此处 - 它将在这里或那里提到,但它不被认为是网络可编程性问题的未来经过深思熟虑的解决方案。
There is bound to be some question about the inclusion of screen scraping mechanisms off of a CLI in the definitions here. Two points should be noted about CLIs. First, the complexity of a screen scraping application is much greater than the complexity of a true machine interface. Screen scrapers must keep up with each possible change in the output and input formats of each device; dealing with the raw data models of each device, even if they are all different, and can potentially all change on a regular basis, removes one step out of the processing. It can be argued moving from screen scraping to a true programmable interface removes the most complex and difficult to maintain pieces of the process. Second, screen scraping is an imperfect adaption that works over small-scale problems, but doesn’t really provide a holistic view of the network, nor is it scalable in the same way. Screen scraping is not included here as a programmable interface for these reasons—it will be mentioned here and there, but it’s not considered a future looking well thought out solution for the network programmability problem.
理解一项技术的一个好方法是思考它旨在解决的问题。检查用例不仅可以帮助工程师了解技术的工作原理,还可以帮助他们了解其工作原理的原因。这里介绍两个用例:带宽日历和软件定义的边界。
One good way to understand a technology is to think through the problems it’s designed to solve. Examining use cases not only helps engineers to understand how the technology works, but also to understand why it works the way it does. Two use cases are presented here: bandwidth calendaring and software-defined perimeter.
每个网络在不同时期都有一个正常的“使用模式”。有时网络订阅量过多,导致数据包丢失,有时网络订阅量不足,可用带宽未被使用。图 9.1说明了企业网络的典型带宽利用率。
Each network has a normal “pattern of usage” across time. There are times when the network is oversubscribed, and dropping packets, and other times when the network is undersubscribed, with bandwidth available that’s not being used. Figure 9.1 illustrates somewhat typical bandwidth utilization for a corporate network.
显而易见的是网络(或链路、结构或测量利用率的其他点)非常低的时间。尝试将使用大量带宽的应用程序放入这些较低利用率的时段是很直观的。该领域的解决方案从简单到复杂都有。例如:
What should be immediately obvious are the times during which the network (or link, or fabric, or some other point at which this utilization is being measured) is very low. It’s intuitive to try and put applications that use a lot of bandwidth into these lower utilization periods. The solutions in this space range from the simple to the complex. For instance:
• 网络的利用率有多少?测量单个链路的利用率与测量网络中一组链路的利用率不同,特别是当任何应用程序可以使用任何特定路径,并且该路径实际上可能随时间而变化时(考虑到相同成本的负载共享、拓扑变化、 ETC。)。
• How much of the network is utilized? Measuring the utilization of a single link is different than measuring the utilization of a set of links across the network, especially when any application could use any particular path, and the path might actually change over time (given equal cost load sharing, topology changes, etc.).
• 应用程序启动大规模传输的时间必须手动管理。如果网络利用率发生变化,则时间也必须更改。虽然这个手动过程可以通过屏幕抓取和其他方式实现自动化,但这样的过程并不是理想的。
• The time at which the application kicks off large-scale transfers must be manually managed. If network utilization changes, then the times must be changed, as well. While this manual process can be automated through screen scraping and other means, no such process is going to be ideal.
• 如果网络利用率的间隙持续时间太短,应用程序可能无法在可用的利用率间隙内完成其工作。在这种情况下,应用程序是否应该以某种方式暂停,并且应该如何进行“暂停”?这种暂停对数据一致性和应用程序使用有什么影响?显而易见的解决方案是某种网络接口,包括反馈循环,使应用程序能够近乎实时地查看网络条件中数据流变化的影响。
• If the gaps in the network’s utilization are too short in duration, the application might not be able to finish its work within the available gap in utilization. In this case, should the application somehow be paused—and how should that “pausing” take place? What is the impact of such a pause on data consistency and application usage? The obvious solution is some sort of interface into the network, including a feedback loop that allows the application to see the impact of changes in the data flow in network conditions in near real time.
• 在某些情况下,应用程序流量可能会被放置在一组特定的链路上,从而使网络的其余部分不受应用程序流量的影响。这种类型的流量工程解决方案与带宽日历无关(这两种解决方案可以在单个网络中交互和重叠),而且也与跨单个结构的大象和老鼠流量无关。
• There may be situations where the applications traffic can be placed on a specific set of links, allowing the remainder of the network to be unaffected by the application’s traffic. This type of traffic engineering solution is tangential to bandwidth calendaring—the two solutions can interact and overlap in a single network—but also to elephant and mouse flows across a single fabric.
通过手动配置,可以解决这些挑战(需要许多人花费数年的时间在键盘上使用大量手指)。鉴于每个设备的用户界面不会经常改变,通过更先进的屏幕抓取类型的自动化,可以在该过程中做出明确的改进。然而,要真正使网络和应用程序之间的交互良好进行,网络需要一个可编程接口,应用程序和网络可以通过该接口进行通信。图 9.2说明了该用例。
These challenges can be met—with a lot of fingers on keyboards across many people-years of time—through manual configuration. A definite improvement can be made in the process through more advanced screen scraping types of automation, given the user interfaces of each device doesn’t change too often. To truly make the interaction between the network and application to work well, however, the network needs a programmable interface through which the application and the network can communicate. Figure 9.2 illustrates the use case.
流程如图9.2所示:
The process shown in Figure 9.2:
1.应用程序向控制器发送有关即将到来的流的通知。这可能包括有关流的大小和速率、流必须完成的时间、流暂停的能力以及流是否接近实时(流信息)或相当静态的数据(块)的信息。贮存)。
1. The application sends a notification to the controller about an upcoming flow. This could include information about the size and rate of the flow, the time by which the flow must be completed, the ability of the flow to be paused, and whether the flow is near real time (streaming information) or fairly static data (block storage).
2.网络中的转发设备提供有关当前利用率、队列大小和其他网络状态信息的信息。
2. The forwarding devices in the network provide information on the current utilization, queue sizes, and other network state information.
3.控制器与数据存储交互,该数据存储包含有关网络状态的历史信息,包括带宽利用率、队列深度、每个链路上的抖动和延迟等。
3. The controller interacts with a data store that contains historical information about the state of the network, including bandwidth utilization, queue depths, jitter and delay across each link, and so on.
4.控制器确定应用程序应在所需时间内开始完成的时间,同时对网络上运行的其他应用程序的影响最小。
4. The controller determines when the application should start to finish within the time required with minimal impact on the other applications running on the network.
5.控制器向应用程序发出何时应启动流程的信号。
5. The controller signals the application when it should kick the flow off.
6.如有必要,控制器将配置服务质量 (QoS)、保留链路、将流量移出链路等,以使应用程序成功传输。
6. If necessary, the controller will configure—Quality of Service (QoS), reserve links, move traffic off of links, and so on, to make the application’s transfer successfully.
7.如果网络条件动态变化,控制器可以重新计算在网络上传输应用程序流所需的时间,向应用程序发出信号以根据需要暂停流。
7. If network conditions change on the fly, the controller can recalculate the time required to carry the application’s flow across the network, signaling the application to pause the flow as needed.
这种类型的过程需要转发设备、应用程序和控制器之间有一组可靠的接口。如果没有这样一组接口,这一切可以说是可能的,但很难大规模实现。
This type of process requires a solid set of interfaces between the forwarding devices, the application, and a controller. This is all arguably possible without such a set of interfaces, but it would be difficult to achieve on a large scale.
为了进一步实现这一目标,任何此类接口都应该是开放的,任何应用程序,无论是内置的还是商业的,都应该能够与任何控制器进行通信,并且控制器应该能够与任何一组网络进行通信设备,无论供应商如何。最后一个含义是否仅适用于硬件(任何供应商的硬件与单一供应商的软件),还是同时适用于软件和硬件,这是计算机网络世界中进一步讨论的问题(实际上是政治策略和激烈的争论)。
To carry this one step further, any such interface should be open, in the sense that any application, either built in house or commercial, should be able to communicate with any controller, and the controller should be able to communicate with any set of network devices, regardless of the vendor. Whether this last implication applies only to the hardware—any vendor’s hardware with a single vendor’s software—or across both software and hardware is a matter of further discussion (actually political maneuvers and loud boisterous arguments) within the computer networking world.
网络安全通常被定义为一座城堡;有内部和外部,如图9.3所示。
Network security is often defined like a castle; there is an inside and an outside, as illustrated in Figure 9.3.
在图9.3中,有城墙防御系统的一部分。
In Figure 9.3, there is a portion of a castle wall defense system.
• 端塔A 和E 为城墙和大门提供覆盖。防御者可以从这些点向任何攻击者的侧翼射击,阻止对城墙和大门系统的攻击。
• The end towers, A and E, provide coverage for the wall and gate. Defenders can shoot into the flank of any attackers from these points, discouraging attacks on the walls and gate system.
• 外门B 并不是您想象中的堡垒——尽管它们在电影中看起来往往令人印象深刻。事实上,城墙的城门通常是系统,而不是单一的门。在古代以色列,该系统有六个房间和七个门,其中最里面的门是最强大的。B门当然是一扇强大的门,但它的主要目的是过滤,而不是完全封锁。如果攻击者能够通过这扇门,它会设计有一个从上方落下的辅助门(一个吊闸),将攻击者困在第一个房间内,在那里他们可以在周围墙壁和天花板的保护下被摧毁。将敌人分成更小的群体(分而治之)是墙系统的主要策略。
• The outer gate, B is not the bulwark you might think—though they tend to look impressive in movies. In reality, the gate of a walled city was normally a system, rather than a single gate. In ancient Israel, the system has six chambers with seven gates, with the innermost gate being the most formidable. Gate B is a formidable gate, of course, but its primary purpose is to filter, rather than to completely block. If attackers can get past this gate, it is designed with a secondary gate that will drop from above—a portcullis—trapping attackers inside the first chamber, where they can be destroyed from the protection of the surrounding walls and ceilings. Breaking your enemy up into smaller groups (divide and conquer) was much of the strategy of wall systems.
• C 和D 是内门。每个房间都是一个完整的防御系统,被设计为万一前一个大门失效时的后退——纵深防御的第一阶段。从某种意义上说,C 门只是为了保护 D 门免受攻击(保护内门的完整性)。
• C and D are inner gates. Each chamber is a complete defense system, designed as a fall back in case the previous gate fails—the first stage of defense in depth. In a sense, gate C is only there to protect gate D from attack (to protect the integrity of the inner gate).
如果入侵者进入外墙,通常会有一座带门的内墙,等等,最后是一个要塞,或某种形式的寺庙,作为堡垒中的完整堡垒。这种墙体系统在当时很有效。250 名守军可以抵御规模惊人的军队,守住一座有城墙的城市,因此需要长期围攻才能饿死守军。对有城墙的城市最有效的攻击是内部攻击——贿赂城里的某人以打开较小的侧门。在现代世界,这些系统看起来令人印象深刻,但效果不佳。为什么?
If an invader made it inside the outer wall, there was often an inner wall with gates, and so on, and finally a keep, or some form of temple that would serve as a complete fortress within a fortress. Such wall systems were effective in their time; 250 defenders could hold a walled city against armies of incredible sizes, necessitating long sieges to starve the defenders out. The most effective attack against a walled city was an insider attack—bribing someone in the city to leave a smaller side gate open. In the modern world, these systems seem impressive but ineffectual. Why?
• 大炮使攻破城墙变得更加容易。
• Artillery made breaching the walls much easier.
• 隧道机制允许从下面突破墙壁。
• Tunneling mechanisms allowed breaching the walls from underneath.
• 飞机使得飞越墙壁成为可能,从上面扔下东西。
• Airplanes made it possible to just fly over the walls, dropping stuff from above.
网络安全边界的构建很像一座城墙,如图9.4所示:
Network security perimeters are built much like a walled city, as shown in Figure 9.4:
• 路由器F 主要配置为执行一些基本过滤。这并不是真正要阻止大多数攻击,而是为了保护G(防火墙)免受外部通过旨在溢出链接、G的输入接口或其他方式的攻击。这相当于城墙防御系统上的B门;该外层可以快速处理一些简单的攻击,而无需使用大量网络资源。
• Router F is configured primarily to do some basic filtering. This isn’t really to block most attacks, but rather to protect G, a firewall, from attack from the outside through methods designed to overflow the link, G’s input interface, or other means. This is the equivalent of gate B on the walled defense system; this outer layer takes care of some simple attacks quickly without using a lot of network resources.
• 主机H 可能托管“牺牲”服务,但它也充当蜜罐。任何针对此服务器的攻击都可以测量和量化,这有助于在路由器 F 上构建更好的过滤器,在路由器 G 上构建更好的状态检查规则。此处收集的信息还可以指示攻击者试图访问哪些内部服务,等等。这相当于城墙防御系统中B门和C门之间的密室。请注意,如果该主机被接管,则可以用作入侵者的攻击基地,就像城墙城门之间的房间一样。这样的主机需要受到严格的本地安全控制。
• Host H may host “sacrificial” services, but it also serves as a honeypot. Any attacks against this server can be measured and quantified, which helps build better filters on Router F and better stateful inspection rules on G. Information gathered here can also indicate which internal services attackers are trying to reach, and so on. This is the equivalent of the chamber between Gates B and C in the walled defense system. Note this host can be used as a base of attack by an intruder if it is taken over, much as the chamber between the gates in a walled city. Such a host needs to be well controlled with tight local security.
• 安全设备G 可能包括无状态过滤器、有状态数据包过滤器和网络地址转换。这是“主门”,类似于城墙防御系统中的 C 门。此时网络地址转换提供了“故障关闭”保障;软件错误会导致连接丢失,而不是开放访问。
• The security appliance, G, may include stateless filters, stateful packet filters, and network address translation. This is the “primary gate,” similar to gate C in the walled defense system. Network address translation at this point provides a “fail closed” safeguard; software errors result in a loss of connectivity, rather than open access.
传统DMZ系统的缺点是什么?与城墙的城市几乎相同:
What are the weaknesses of the traditional DMZ system? Pretty much the same as a walled city:
• 拒绝服务(DoS) 攻击可以向DMZ 投掷大炮,破坏或破坏网络门系统。
• Denial of Service (DoS) attacks can throw artillery at the DMZ, breaking or breaching the network gate system.
• 通往外界的隧道提供了一个入口点,攻击者可以通过该入口点穿过墙壁而无需穿过大门。
• Tunnels to the outside world provide an entry point through which attackers can reach through the wall without passing through the gates.
• 直接针对服务器的攻击(例如针对 HTTP 服务器的脚本攻击)提供了一种“飞越”防御系统的方法。
• Direct to server attacks (such as scripting attacks against HTTP servers) provide a way to “fly over” the defense system.
• 当然,内部攻击始终存在。
• And, of course, there’s always the ever present insider attack.
事实证明,自中世纪以来,情况并没有太大改变。随着攻击网络围墙的工具变得更加先进,网络安全需要从构建围墙转向构建能够防御特定地点免受各种攻击的流动机动部队——就像现代流动机械化部队一样。用不太军事化的术语来说,网络需要从外部松脆、中间耐嚼转变为彻底松脆。
As it turns out, things haven’t changed much since the Middle Ages. As the tools for attacking the network’s walls become more advanced, network security needs to move from building walls to building fluid mobile forces that can defend specific spots against a variety of attacks—just like the modern fluid mechanized force. To put it in less military terms, the network needs to move from being crunchy on the outside and chewy in the middle to being crunchy through and through.
但网络如何实现转变呢?一种可能性是使用可编程网络中可用的工具。图 9.5说明了这一点。
But how do networks make the transition? One possibility is to enlist the tools available in the programmable network. Figure 9.5 illustrates.
在这个网络中:
In this network:
1.主机 A 向位于服务器 D 上的服务发送数据包。对于此示例,假设此数据包正在启动与服务的会话,并且数据包包含某种形式的服务登录凭据。
1. Host A sends a packet toward a service, located on Server D. For this example, assume this packet is initiating a session with the service, and the packet contains some form of login credentials to the service.
2.路由器 B 并没有简单地转发此数据包,而是注意到该服务已被标记为安全,因此它将数据包重定向到控制器。
2. Router B, rather than simply forwarding this packet, notes the service has been marked secure, so it redirects the packet toward a controller.
3.控制器 E 检查数据包并确定数据包的目标服务。然后,控制器联系服务本身(如图所示)或某些集中式身份服务(例如 OpenStack 中的 Keystone 服务实例),以验证用户访问服务的能力。
3. The controller, E, examines the packet and determines which service the packet is destined toward. The controller then contacts the service itself (shown here), or some centralized identity service (such as a Keystone service instance in OpenStack), to verify the user’s ability to access the service.
4.目标服务或身份服务返回拒绝该特定用户访问服务的请求。
4. The target service, or the identity service, returns a denial for the request to access the service for this particular user.
5.控制器使用此信息设置过滤器,阻止用户访问服务所在的服务器 D,并以拒绝或登录失败的方式回复原始数据包。
5. The controller uses this information to set up a filter blocking the user from accessing the server, D, on which the service resides, and replies to the original packet with a denial or login failed.
虽然这更多的是身份验证、授权、访问 (AAA) 示例,但相同的原则也适用于状态过滤和其他安全措施。可编程网络可以将安全策略推送到网络边缘,在每个服务或每个网络拓扑元素的基础上提供软件定义的边界。
While this is more of an Authentication, Authorization, Access (AAA) example, the same principle holds for stateful filtering and other security measures. Programmable networks can push security policy to the edge of the network, providing a software-defined perimeter on a per service or per network topology element basis.
接口是弥合传统控制平面、应用程序和策略实施与可编程网络之间差距的关键点。网络(或网络设备)与对网络进行编程的应用程序或系统之间的接口分为两个不同的部分。
Interfaces are a key point in bridging the gap between traditional control planes, applications and policy implementation, and the programmable network. The interface between the network (or network devices) and applications or systems programming the network is broken into two different pieces.
北向接口是各个网络设备向控制器提供的信息;这些信息通常会汇总到整个网络的更大视图中。该接口包含三种特定类型的信息:
The northbound interface is information individual network devices provide to controllers; this information would normally be rolled up into a larger view of the network as a whole. This interface contains three specific kinds of information:
•功能:提供可以编程或控制的信息。这包括元数据,或者更确切地说,有关如何组织控制结构以及如何通过网络设备向控制器呈现信息的信息。
• Capabilities: Provides information what can be programmed or controlled. This included metadata, or rather information about how the control structures are organized, and how information is presented by network devices to the controller.
•库存:提供有关网络中安装哪些设备的信息,可能包括有关物理连接的任何信息。
• Inventory: Provides information about what devices are installed where in the network, potentially including any information about physical connections.
•拓扑:提供有关连接网络设备的链路状态的信息。这可能包括本机带宽等内容。
• Topology: Provides information about the state of links connecting network devices. This may include things like the native bandwidth.
•遥测:包括操作状态、计数器以及有关当前网络状态的其他信息。这包括可用带宽、队列深度、延迟和抖动等。
• Telemetry: Includes operational state, counters, and other information about the current network state. This includes things like the available bandwidth, queue depths, delay, and jitter.
南向接口是从控制器推送到网络(或网络设备)的所需状态。这里至少有三个具体的接口:
The southbound interface is desired state pushed from the controller into the network (or network devices). There are at least three specific interfaces here:
•转发信息库(FIB):用于通过交换设备转发数据包的实际表的直接接口。该接口绕过内部控制平面进程、静态配置的转发信息等。
• Forwarding Information Base (FIB): A direct interface into the actual tables used to forward packets through the switching device. This interface bypasses and internal control plane processes, statically configured forwarding information, and so on.
•路由信息库(RIB):用于构建FIB 的表的直接接口。注入到该接口的信息与其他控制平面进程安装的转发信息、静态转发信息等混合在一起。这可能包括进入协议特定表的接口,例如内部 RIB 或拓扑表 - 然而,此类接口的问题是它们直接干扰分布式协议的操作,这些协议具有非常特定的规则,旨在防止最佳计算中的循环。路径。与这些表的接口可以有更大的通过注入导致协议无法收敛的信息,比单个设备故障或本地路由环路产生的副作用更大。
• Routing Information Base (RIB): A direct interface into the tables used to build the FIB. Information injected into this interface is mixed with forwarding information installed by other control plane process, static forwarding information, and so on. This could include interfaces into protocol specific tables, such as an internal RIB or topology table—the problem with such interfaces, however, is they interfere directly with the operation of distributed protocols that have very specific rules designed to prevent loops in the calculation of best paths. Interfacing with these tables can have much larger side effects than a single device failure or a local routing loop by injecting information that causes the protocol to fail to converge.
•交换路径:与转发表无关但直接影响数据包处理(例如QoS)的转发参数的接口。
• Switching Path: An interface into forwarding parameters that don’t relate to the forwarding tables, but directly impact packet handling, such as QoS.
这些接口都不包含传统上被认为是网络管理的内容。它们三者都与数据包通过设备时的处理方式直接相关,而不是与功率级别、任何已安装处理器的温度、内存利用率等相关。图 9.6说明了这些接口。
None of these interfaces encompass what might traditionally be considered network management. They all three relate directly to how packets are handled as they pass through the device, and not to things like the power level, the temperature of any installed processors, memory utilization, and so on. Figure 9.6 illustrates these interfaces.
正在开发哪些技术来解决迄今为止描述的用例?本节概述了可编程网络空间中开发(或正在开发)的四种截然不同的技术。此概述并未提供完整的图片,因为此处的重点是提供可靠的信息考虑网络可编程性和复杂性的背景。将考虑的技术包括 OpenFlow、YANG、路径计算元素协议 (PCEP) 和路由系统接口 (I2RS)。
What technologies are being developed to address the use cases described so far? This section provides an overview of four very different types of technology developed (or being developed) in the programmable network space. This overview doesn’t offer a complete picture, because the focus here is to provide a solid background for considering network programmability and complexity. The technologies that will be considered are OpenFlow, YANG, Path Computation Element Protocol (PCEP), and the Interface to the Routing System (I2RS).
OpenFlow 最初旨在促进在实验环境中研究新的控制平面协议和系统。由于大多数高速转发硬件都与特定供应商的设备绑定,因此几乎没有空间来开发和实施新的控制平面来查看是否可以找到更有效的管理可达性广告的系统。OpenFlow 是在网络行业开发的,当时大型机箱式交换机很常见;在机箱交换机中,控制平面在路由处理器上运行,而实际的高速转发硬件则在线卡上安装和配置。这种功能分离提供了一个可以连接外部设备的自然点;在大多数基于机箱的路由器中,将路由处理器与线卡分开,允许在路由处理器上运行的控制平面进程被部署在外部设备(控制器)上的不同控制平面所取代。
OpenFlow was originally designed to facilitate the study of new control plane protocols and systems in an experimental setting. As most high speed forwarding hardware is tied to a specific vendor’s equipment, there is little room to develop and implement a new control plane to see if more efficient systems of managing the advertisement of reachability can be found. OpenFlow was developed at a time in the networking industry when large chassis-type switches were common; in a chassis switch the control plane runs on a route processor, while the actual high-speed forwarding hardware is installed and configured on line cards. This separation of functionality provides a natural point at which an external device can be connected; separating the route processor from the line cards in most chassis-based routers allows the control plane processes running on the route processor to be replaced by a different control plane deployed on an external device, a controller.
在图9.6中,OpenFlow是向FIB提供转发信息的南向接口。
Within Figure 9.6, then, OpenFlow is a southbound interface providing forwarding information to the FIB.
图 9.7说明了 OpenFlow 的操作。
Figure 9.7 illustrates the operation of OpenFlow.
在此图中:
In this figure:
1.主机 A 向服务器 D 发送数据包。
1. Host A transmits a packet toward Server D.
2. Router B 收到此数据包,检查其本地转发表,发现没有此数据包的转发信息目的地。由于本地没有转发信息,Router B 将报文转发给Controller E 进行处理。
2. This packet is received by Router B, which examines its local forwarding table, and discovers it doesn’t have forwarding information for this destination. As there is no local forwarding information, Router B forwards the packet to the Controller, E, for processing.
3.控制器 E 检查从路由器 B 收到的数据包。控制器确定正确的路由信息(发现此信息的过程由控制器决定,但它可以像连接控制器到网络中的每个路由器,因此它对每个目的地都有完全的可见性,或者控制器可以运行分布式路由协议来与其他控制器交换可达性信息)。一旦确定了这些信息(包括下一跳重写、正确的出站接口等),流条目就会安装到路由器 B 的转发表中。这将创建路由器 B 在此流中转发数据包所需的状态。
3. The Controller, E, examines the packet it has received from Router B. The controller determines the correct routing information (the process by which this information is discovered is left up to the controller, but it could be something as simple as connecting the controller to every router in the network, so it has full visibility for every destination—or the controller could run a distributed routing protocol to exchange reachability information with other controllers). Once this information is determined, including a next hop rewrite, the correct outbound interface, and so on, a flow entry is installed into Router B’s forwarding table. This creates the state necessary for Router B to forward packets in this flow.
4.控制器E 也可以确定该流量必须经过Router C,因此它也将计算出的出接口和其他信息安装到Router C 的转发表中。
4. The Controller, E, can also determine this traffic flow must pass through Router C, so it installs the calculated outbound interface and other information into Router C’s forwarding table, as well.
5.主机 A 将来向服务器 D 发送的一些数据包现在将根据此缓存的流标签信息通过路由器 B 和 C 转发。
5. Some future packet Host A sends toward Server D will now be forwarded through Routers B and C based on this cached flow label information.
笔记
Note
对于长期的网络工程师来说,这个数据包处理过程应该听起来很熟悉。例如,它与旧版思科路由器的“快速缓存”处理几乎相同;有关详细信息,请参阅Cisco IOS 软件架构内部。
This packet processing process should sound familiar to long time network engineers. It is almost identical to the “fast cache” processing of older Cisco routers, for instance; see Inside Cisco IOS Software Architecture for more information.
关于 OpenFlow 需要记住的一个关键点是,它最初设计为携带来自转发数据包标头的完整信息的流条目,包括源 IP 地址、目标 IP 地址、IP 协议号、源端口和目标港口。流标签可以包含将实现每个特定流的转发的特定信息,也可以通过通配符包含一组流的信息。例如,可以使用仅包含目标 IP 地址信息的流条目来模拟标准 IP 路由。
A key point to remember about OpenFlow is it was originally designed to carry flow entries with full information from the header of the forwarded packet, including the source IP address, the destination IP address, the IP protocol number, the source port, and the destination port. Flow labels can contain specific information that will implement forwarding for each specific flow, or they can contain information for a group of flows through wildcards. For instance, a standard IP route can be emulated using flow entries with just the destination IP address information.
OpenFlow 已被众多供应商不同程度地实施。
OpenFlow has been implemented, in varying degrees, by a wide array of vendors.
有关 OpenFlow 的更完整描述,请参阅William Stallings 所著的《现代网络基础:SDN、NFV、QoS、IoT 和云》 (Addison-Wesley Professional,2015 年)。
See Foundations of Modern Networking: SDN, NFV, QoS, IoT, and Cloud by William Stallings (Addison-Wesley Professional, 2015) for a more complete description of OpenFlow.
在大多数协议中,数据模型绑定到协议本身。例如,IS-IS 协议在一组类型长度向量中携带可达性信息,并绑定在任何设备上运行的任何 IS-IS 进程都可以理解的特定数据包格式中。因此,协议中携带的信息的格式和信息的传输机制被组合成一个对象或一件事物。YANG 的不同之处在于它实际上是一种建模语言,旨在描述网络设备中的转发和其他状态。
In most protocols, the data model is bound into the protocol itself. For instance, the IS-IS protocol carries reachability information in a set of Type-Length-Vectors, bound within a specific packet format any IS-IS process running on any device can understand. Hence, the format of the information carried within the protocol and the transportation mechanism for the information are combined into one object, or one thing. YANG is different in that it is actually a modeling language designed to describe the forwarding and other state in network devices.
笔记
Note
数据模型和信息模型密切相关但又不同。信息模型描述信息流、信息结构以及处理信息的不同进程之间的交互。数据模型关注信息本身的结构,包括保存信息的不同结构之间的关系。YANG作为一种建模语言,是一种数据模型而不是信息模型,因为它关注的是信息的结构。基于 YANG 的信息模型可能包括使用 YANG 模型构建的数据传输,以及有关如何使用数据来实现网络中特定状态的信息。
Data models and information models are closely related but different things. An information model describes the flow of information as well as the structure of the information and the interaction between the different processes that handle the information. A data model is focused on the structure of the information itself, including the relationships between different structures that hold information. YANG, as a modeling language, is a data model rather than an information model, as it focuses on the structure of information. An information model based on YANG could potentially include the transport of data structured using a YANG model, as well as information about how the data is used to achieve specific states in the network.
笔记
Note
YANG 规范由互联网工程任务组 (IETF)发布为 RFC6020 2 。
The YANG specification is published as RFC60202 by the Internet Engineering Task Force (IETF).
2 . M. Bjorklund 编辑,“YANG - 网络配置协议 (NETCONF) 的数据建模语言”(IETF,2010 年 10 月),2015 年 9 月 24 日访问,https: //datatracker.ietf.org/doc/rfc6020/。
2. M. Bjorklund, ed., “YANG - A Data Modeling Language for the Network Configuration Protocol (NETCONF)” (IETF, October 2010), accessed September 24, 2015, https://datatracker.ietf.org/doc/rfc6020/.
YANG 是一种模块化语言,表示为可扩展标记语言 (XML) 树结构 — XML 的一个更常见的子集是 HTML,用于承载浏览器呈现为网页的指令。作为一种语言,YANG 并没有真正指定任何有关网络设备的信息;它只是提供了一个用于表达有关网络设备的信息的框架,就像任何自然语言的一组语法规则一样。
YANG is a modular language expressed as an eXtensible Markup Language (XML) tree structure—a more familiar subset of XML is the HTML, used to carry the instructions browsers render as web pages. As a language, YANG doesn’t really specify any information about network devices; it just provides a framework for expressing information about network devices, much like a set of grammar rules for any natural language.
笔记
Note
虽然 HTML 是 XML 的一个粗略子集,但 HTML 实际上是在 XML 之前开发的。HTML 在该领域的成功导致了超集标记系统的开发,该系统可以用于比 HTML 更广泛的用途,包括信息的一般结构,例如 YANG。从某种意义上说,HTML 是 XML 的“之父”,而 XML 又催生了许多 HTML 的同类。在这个意义上,YANG 可以被认为是 YANG 的对等体,因为它是用于特定情况的 YANG 的子集或更具体的定义。
While HTML is a rough subset of XML, HTML was actually developed before XML. The success of HTML in the field led to the development of a superset markup system that could be used for a wider array of uses than HTML, including the general structuring of information such as YANG. In a sense, HTML is the “father” of XML, which, in turn, has spawned a number of peers to HTML. YANG can be considered a peer of YANG in this sense, as it’s a subset or more specific definition of YANG used for a specific case.
在撰写本文时,正在为网络协议和设备开发许多 YANG 模型。例如,下面是用于构建链接失败通知的模型片段。3
A number of YANG models are being developed for network protocols and devices at the time of this writing. For instance, a snippet of a model for structuring the notification of a link failure follows.3
3 . M Bjorklund,“用于网络配置协议 (NETCONF) 的 YANG-A 数据建模语言”,YANG Central,np,最后修改于 2010 年 10 月,访问于 2015 年 6 月 14 日,http://www.yang-central.org/twiki/ pub/Main/YangDocuments/rfc6020.html#rfc.section.4.2.2.5。
3. M Bjorklund, “YANG-A Data Modeling Language for the Network Configuration Protocol (NETCONF),” YANG Central, n.p., last modified October 2010, accessed June 14, 2015, http://www.yang-central.org/twiki/pub/Main/YangDocuments/rfc6020.html#rfc.section.4.2.2.5.
notification link-failure {
description "检测到链路故障";
叶 if-name {
类型 leafref {
路径“/interface/name”;
}
}
叶 if-admin-status {
输入 admin-status;
}
叶 if-oper-status {
类型 oper-status;
}
}
<notification
xmlns="urn:ietf:params:netconf:capability:notification:1.0">
<eventTime>2007-09-01T10:00:00Z</eventTime>
<link-failure xmlns="http://acme.example .com/system">
<if-name>so-1/2/3.0</if-name>
<if-admin-status>向上</if-admin-status>
<if-oper-status>向下</ if-oper-status>
</link-failure>
</notification>
notification link-failure {
description "A link failure has been detected";
leaf if-name {
type leafref {
path "/interface/name";
}
}
leaf if-admin-status {
type admin-status;
}
leaf if-oper-status {
type oper-status;
}
}
<notification
xmlns="urn:ietf:params:netconf:capability:notification:1.0">
<eventTime>2007-09-01T10:00:00Z</eventTime>
<link-failure xmlns="http://acme.example.com/system">
<if-name>so-1/2/3.0</if-name>
<if-admin-status>up</if-admin-status>
<if-oper-status>down</if-oper-status>
</link-failure>
</notification>
第一个代码片段将模型中的信息显示为一组声明(例如可能适合围绕处理此信息构建一个软件)。第二个显示以 XML 表示的相同信息,并带有标记来指示每个部分中包含什么类型的信息。大多数用户界面将以 XML 格式显示模型以供人类阅读,尽管它们也可以检索声明类型表示形式。
This first code snippet shows the information in the model as a set of declarations (such as might be appropriate for building a piece of software around handling this information). The second shows the same information expressed in XML, with markups to indicate what type of information is contained in each section. Most user interfaces will show models in the XML format for human readability, though they can also retrieve the declaration type representation.
笔记
Note
有关 YANG 模型的更完整示例,请参阅 RFC6241。4
For more complete examples of YANG models, see RFC6241.4
4 . R. Enns,“网络配置协议 (NETCONF)”(IETF,2011 年 6 月),2015 年 9 月 24 日访问,https://www.rfc-editor.org/rfc/rfc6241.txt。
4. R. Enns, “Network Configuration Protocol (NETCONF)” (IETF, June 2011), accessed September 24, 2015, https://www.rfc-editor.org/rfc/rfc6241.txt.
如果每个供应商都实现可以用标准 YANG 模型表达每个网络元素状态的接口,那么网络中的每个设备都可以通过单个接口进行编程。当然,这是一个“独角兽梦想”,但 IETF 和其他开放标准组织正在努力实现这一目标,为所有基于开放标准的协议创建模型,并为通用网络设备创建通用模型。
If every vendor implements interfaces that can express the state of each network element in terms of a standard YANG model, then every device in the network could be programmed through a single interface. This is a “unicorn dream,” of course, but the IETF and other open standards organizations are working toward this goal by creating models for all open standards based protocols, and common models for generic network devices.
有了建模语言和实际模型,下一个必须回答的问题是——这些信息如何通过网络传输?虽然有多种方式,但有两种方式值得注意,因为它们被专门定义为携带 YANG 信息。
With the modeling language and actual models in place, the next question that must be answered is—how is this information carried through the network? While there are a variety of ways, two are of special note, as they are being defined specifically to carry YANG information.
• NETCONF 是一种基于远程过程调用的协议,专门设计用于以XML 格式传送YANG 编码信息。NETCONF 允许应用程序检索、操作和更新设备配置。NETCONF 还提供了访问控制 YANG 模型。
• NETCONF is a Remote Procedure Call-based protocol designed specifically to carry YANG encoded information in an XML format. NETCONF allows an application to retrieve, manipulate, and update device configuration. An access control YANG model is also provided with NETCONF.
• RESTCONF 是NETCONF 的变体。它仍然是专门设计用于以 XML 格式承载 YANG 编码信息,但它仅提供 REpresentational State Transfer (REST) 接口。在这种情况下,REST 接口意味着路由器上不保存任何状态,例如外部应用程序过去请求的内容、外部应用程序发出的上一个命令等。REST 接口将管理当前状态的复杂性从受控设备转移到控制应用程序中,从而更容易在受控设备上实现 REST 接口;这对于许多类型的网络设备和应用程序来说都是理想的选择。
• RESTCONF is a variation on NETCONF. It is still specifically designed to carry YANG encoded information in an XML format, but it only provides a REpresentational State Transfer (REST) interface. A REST interface, in this case, means there is no state held on the router, such as what the external application has asked for in the past, the previous command issued by the external application, and so on. A REST interface transfers the complexity of managing current state out of the device being controlled and into the controlling application, making it easier to implement a REST interface on the controlled device; this can be ideal for many types of network devices and applications.
YANG、NETCONF 和 RESTCONF 最初开发的目的是为设备功能和管理提供北向和南向接口,并为遥测和拓扑提供北向接口。
YANG, NETCONF, and RESTCONF were originally developed to provide northbound and southbound interfaces into device capabilities and management, and to provide a northbound interface for telemetry and topology.
笔记
Note
YANG 被设计为独立于传输;YANG 模型不应依赖于 NETCONF 或 RESTCONF 的格式或功能。相反,从理论上讲,NETCONF 和 RESTCONF 可以携带使用 YANG 以外的其他语言建模的信息。然而,虽然 YANG 很可能会在 RESTCONF 或 NETCONF 之外广泛使用,但这两种传输协议不太可能用于除了传输 YANG 格式信息之外的任何其他用途。
YANG is designed to be transport independent; YANG models are not supposed to rely on the formatting or capabilities of either NETCONF or RESTCONF. Conversely, NETCONF and RESTCONF could, in theory, carry information modeled using some other language than YANG. While it’s probable, however, that YANG will be widely used outside RESTCONF or NETCONF, it’s not likely that the two transport protocols will ever be used for anything other than transporting YANG formatted information.
PCEP 最初开发用于连接:
PCEP was originally developed to connect:
• 路径计算元件 (PCE),它可以根据特定约束(例如可用带宽)计算通过网络的路径,以到达 —
• A Path Computation Element (PCE), which can calculate paths through the network based on specific constraints, such as available bandwidth, to a —
• 路径计算客户端(PCC),它可以接收PCE计算的路径并根据它们转发流量
• A Path Computation Client (PCC), which can receive the paths computed by a PCE and forward traffic based on them
PCEP 最初设计用于支持跨服务提供商 (SP) 边界的流量(称为 AS 间流量)的流量工程。IETF RFC 4655 中记录的总体思路是允许一个 SP 为传递到另一个 SP 的流设置特定参数,以便第二个 SP 可以计算流量的最佳路径,而无需允许第一个 SP 实际控制其网络的转发策略。图 9.8说明了这一点。
PCEP was originally designed to support traffic engineering for flows that cross Service Provider (SP) boundaries (called Inter-AS traffic flows). Documented in IETF RFC 4655, the general idea was to allow one SP to set specific parameters for flows being passed to another SP so the second SP could calculate an optimal path for the traffic without allowing the first SP to actually control their network’s forwarding policies. Figure 9.8 illustrates this.
在此图中:
In this illustration:
1.提供商 Y 的控制器 (B) 将有关客户虚拟拓扑的服务质量要求的信息发送到提供商 Z 的控制器 (K)。该信息可以手动配置、作为供应系统的一部分自动配置、或者以某种其他方式用信号通知。在此示例中,假设源自主机 A 的客户流量需要最小量的带宽。
1. Provider Y’s controller (B) sends information about the quality of service requirements for a customer’s virtual topology to Provider Z’s controller (K). This information may be manually configured, automatically configured as part of a provisioning system, or signaled in some other way. Assume, for this example, that the customer’s traffic, originating from Host A, will require a minimum amount of bandwidth.
2.提供商 Z 的控制器 (K) 将计算一条通过网络的路径,该路径将满足该特定流量或隧道的策略所需的约束。此计算通常涉及运行受约束的 SPF 计算,但它可以使用任何方法进行计算 - 从协议的角度来看,这是控制器软件内的实现细节。然后,提供商 Z 的控制器将使用 PCEP 为路由器 D、F 和 G 配置正确的信息,以确保流量沿着具有所需质量的路径传递。
2. Provider Z’s controller (K), will compute a path through the network that will meet the constraints required by the policy for this specific traffic or tunnel. This computation would normally involve running a constrained SPF computation, but it may be calculated using any method—this is an implementation detail within the controller software from the protocol’s point of view. Provider Z’s controller will then use PCEP to configure Routers D, F, and G with the correct information to ensure the flow passes along a path with the required qualities.
3.主机 A 传输的流量将沿着指示的路径传送到服务器 H。
3. Traffic transmitted by Host A will be carried along the indicated path to Server H.
流量如何通过请求的路径(通过 [D,F,G] 的路径)而不是沿着看似最短的路径(通过 [D,G] 的路径)转发?通过在网络中建立MPLS隧道——实际上,PCEP只是沿着路径安装标签,以确保插入到网络中的流量路由器 D 的隧道头端通过隧道通过路由器 F 和 G 传送到服务器 H。
How is the traffic forwarded through the path requested (the path through [D,F,G]) rather than along what appears to be the shortest path (the path through [D,G])? By building an MPLS tunnel through the network—in fact, PCEP is just installing labels along the path to make certain that the traffic inserted into the tunnel headend at Router D is carried through the tunnel through Routers F and G to Server H.
笔记
Note
此示例假设一种设置 MPLS 隧道以沿设计路径传输流量的方法。还有许多其他方式 - 隧道可能在一个提供商的网络中开始并在另一个提供商的网络中终止,或者将流量从路由器 D 传输到 G 所需的标签可能已经存在,因此控制器 K 只需要在头端使用单个路由器 D 即可实现政策要求。事实上,设置这些隧道的方法可能有几乎无限种,具体取决于提供商如何配置其网络、两个提供商之间的关系以及其他问题。
This example assumes one way of setting up the MPLS tunnels to carry the traffic along the engineered path. There are many others—the tunnel may begin in one provider’s network and terminate in the other, or the labels required to tunnel traffic from Router D to G might already exist, so Controller K only needs to single Router D at the headend to effect the policy required. In fact, there is probably an almost infinite number of ways to set these tunnels up, depending on how the provider has configured their network, the relationship between the two providers, and other issues.
如果您认为 OpenFlow 和 PCEP 在操作和意图上相似,那么您是对的。它们都是南向接口,将转发信息从外部控制器传输到转发平面。然而,这两种技术之间的一些差异包括:
If you think OpenFlow and PCEP are similar in operation and intent, you’re right. They’re both southbound interfaces that transmit forwarding information from an external controller to the forwarding plane. Some differences between the two technologies, however, include:
• OpenFlow 向整个 5 或 7 元组发出信号,以在转发设备上构建流表,并预计转发将根据始发主机或设备传输的数据包标头进行。
• OpenFlow signals an entire 5 or 7 tuple to build the flow table on the forwarding devices, and anticipates that forwarding will take place based on the packet header as it’s transmitted by the originating host or device.
• PCEP 发出 MPLS 标签堆栈信号,并预期将基于 MPLS 标签堆栈进行转发。
• PCEP signals an MPLS label stack, and anticipates forwarding will take place based on the MPLS label stack.
• OpenFlow 旨在取代整个控制平面,或者更确切地说,允许控制器像大型基于机箱的系统中的路由处理器一样运行。
• OpenFlow is designed to replace the entire control plane, or rather to allow the controller to operate like a route processor in a large chassis-based system.
• PCEP 旨在增强现有的分布式控制平面。
• PCEP is designed to augment the existing distributed control plane.
虽然 OpenFlow 和 PCEP 在与转发设备的接口方面相似,但它们的初衷和信号内容有所不同。PCEP 的操作(向 MPLS 标签堆栈发送信号以引导流量沿着网络中的特定路径)意味着它可以用作更通用的南向接口。PCEP 可以成为一种有效的工具,可以出于任何原因控制通过 MPLS 传输的网络的流量,而不仅仅是用于域间流量工程。PCEP 得到了广泛部署,增强了现有的分布式路由系统(而不是取代它),并与广泛可用且易于理解的隧道机制 (MPLS) 配合使用,这意味着 PCEP 是许多网络中通用南向接口的良好候选者。
While OpenFlow and PCEP are similar in where they interface with forwarding devices, they are different in their original intent and what they signal. The operation of PCEP—signaling MPLS label stacks to direct traffic along specific paths in the network—means it can be used as a more generic southbound interface. PCEP can be an effective tool to control the flow of traffic through a network with MPLS transport for any reason, rather than just for interdomain traffic engineering. That PCEP is widely deployed, augments the existing distributed routing system (rather than replacing it), and works with a widely available and understood tunneling mechanism (MPLS) means PCEP is a good candidate for a general purpose southbound interface in many networks.
但PCEP只能安装MPLS标签,不能安装二层转发信息。从这个意义上来说,OpenFlow比PCEP更加灵活,既可以用来安装MPLS标签,也可以用来安装二层转发信息。两者之间的选择将取决于已部署的内容、操作人员更熟悉的内容、未来的架构计划以及已安装的设备可提供最佳支持的内容。
PCEP, however, can only install MPLS labels, and not layer 2 forwarding information. In this sense, OpenFlow is more flexible than PCEP, in that it can be used to install either MPLS labels or layer 2 forwarding information. Of the choice between the two is going to be dependent on what is already deployed, what the operations staff are more comfortable with, future architectural plans, and what the already installed equipment offers the best support for.
IETF 网站上的 I2RS 章程规定:
The I2RS charter on the IETF website says:
I2RS 通过一组基于协议的控制或管理接口促进与路由系统的实时或事件驱动的交互。这些允许将信息、策略和操作参数注入路由系统并从路由系统检索(如读取或通过通知),同时保持路由器和路由基础设施之间以及与路由系统的多次交互之间的数据一致性和连贯性。I2RS 接口将与现有的配置和管理系统及接口共存。5
I2RS facilitates real-time or event-driven interaction with the routing system through a collection of protocol-based control or management interfaces. These allow information, policies, and operational parameters to be injected into and retrieved (as read or by notification) from the routing system while retaining data consistency and coherency across the routers and routing infrastructure, and among multiple interactions with the routing system. The I2RS interfaces will co-exist with existing configuration and management systems and interfaces.5
5 . “路由系统接口”,IETF,np,2015 年 6 月 15 日访问,https://datatracker.ietf.org/wg/i2rs/charter/。
5. “Interface to the Routing System,” IETF, n.p., accessed June 15, 2015, https://datatracker.ietf.org/wg/i2rs/charter/.
I2RS 尚未开始与 OpenFlow、PCEP 和其他南向接口竞争。相反,它最初是通过提供以下内容来补充这些努力的:
I2RS was not started to compete with OpenFlow, PCEP, and other southbound interfaces. Rather, it was originally started to complement those efforts by providing:
• RIB 中的第 3 层南向接口将覆盖并支持现有的分布式路由协议
• A Layer 3 southbound interface into the RIB that would overlay and support existing distributed routing protocols
• 进入 RIB 和路由进程的第 3 层北向接口,使拓扑、遥测、库存以及现有分布式路由协议已知的有关网络的其他信息可访问
• A Layer 3 northbound interface into the RIB and routing processes that would make the topology, telemetry, inventory, and other information about the network already known by existing distributed routing protocols accessible
图 9.9说明了 I2RS 建模的一种方法。
Figure 9.9 illustrates one way to model I2RS.
应用程序可以通过 I2RS 客户端与设备(或者更确切地说,网络)进行通信,I2RS 客户端提供设备功能、数据编组、发现、传输和其他服务的通用库。例如,I2RS 客户端可能会从跨多个设备的路由进程收集信息,并提供进入网络统一拓扑视图的接口。I2RS 客户端还可以为保存特定数据的设备提供统一的数据模型,例如 RIB 条目或 BGP 表条目,以不同的方式。最终,I2RS 的目标是让所有设备使用一组通用的数据模型,但在现实世界中这可能永远不可能;客户端提供了一个可以在数据模型之间转换信息而不影响应用程序设计的点。
Applications may either communicate with devices—or rather, the network—through I2RS clients, which provide a common library of device capability, data marshaling, discovery, transport, and other services. An I2RS client, for instance, might gather information from the routing processes across multiple devices, and provide an interface into a unified topological view of the network. I2RS clients can also provide a unified data model for devices that hold specific pieces of data, such as a RIB entry or a BGP table entry, in different ways. Ultimately, the goal of I2RS is to have all devices use a common set of data models, but in the real world this might not ever be possible; the client provides a point at which information can be transformed between data models without impacting application design.
在更传统的环境中考虑 I2RS 的一个有用方法是将图 9.9中的应用程序3 视为“只是另一个路由进程”,它恰好在路由器 Z 之外的通用计算和存储上运行。这可能是一个路由服务器,或者是一个运行的进程在数据中心的标准容器或虚拟机中。无论哪种方式,机外路由进程都可以使用机内路由进程和 RIB 之间的接口来与路由器上的其他路由协议和路由信息源进行交互,从而显得在机上。
A useful way to think of I2RS within a more traditional context is to consider Application 3 in Figure 9.9 “just another routing process” that happens to run on generic compute and storage off Router Z. This might be a route server, or a process running in a standard container or virtual machine in a data center. Either way, the interface between the on box routing processes and the RIB can be used by an off box routing process to interact with the other routing protocols and routing information sources on a router to appear to be on box.
I2RS 架构的一些关键点,如路由系统接口架构(draft-ietf-i2rs-architecture) 中所述,包括:
Some key points of the I2RS architecture, as described in An Architecture for the Interface to the Routing System (draft-ietf-i2rs-architecture) are:
•多个同时异步操作:多个客户端应该能够互不干扰地查询和设置路由及其他信息与彼此。这推动了 I2RS 对近实时事件处理和基于 REST 的操作的要求。如果代理必须在来自多个客户端的操作中保持状态(图 9.8中路由器 Z 中的代理进程可以接收来自两个不同客户端的交替事件),那么接收这些操作的顺序并不重要。换句话说,不同的事件顺序不应导致任何设备的本地 RIB 中的不同状态。这对于保持路由的无环路性质非常重要。
• Multiple Simultaneous Asynchronous Operations: Multiple clients should be able to query and set routes and other information without interfering with one another. This drives the I2RS requirements for near real-time event processing and REST-based operations. If state must be held by the agent across operations from multiple clients (the agent process in Router Z in Figure 9.8 could receive alternating events from two different clients), the order in which these operations are received must not be important. Different event orders, in other words, should not result in different state in the local RIB of any device. This is important to preserve the loop free nature of routing.
•异步、过滤事件:客户端应该能够近乎实时地接收有关受管设备中RIB 或路由进程变化的信息。为了防止通过网络传输不需要的信息,客户端必须能够对从其监视的代理驱动的信息安装过滤器。
• Asynchronous, Filtered Events: Clients should be able to receive information about changes in the RIB or routing processes in managed devices in near real time. To prevent driving unneeded information across the network, clients must be able to install filters on the information being driven from the agents it is monitoring.
•短暂状态:I2RS 与通常通过分布式路由协议构建的两组信息交互——RIB、内部BGP、IS-IS、OSPF 和其他路由协议表。网络工程师并不期望这些信息能够在重新启动后幸存下来——事实上,路由表中的路由在重新启动后幸存下来通常被认为是一件坏事(除非它通过某种协议机制(例如平稳重启)得以保留),因为过时的路由信息可能不会匹配当前网络拓扑。因此,I2RS 永远不会在网络设备中安装任何永久信息。如果设备重新启动后状态必须保持不变,则重新安装过程必须由 I2RS 代理管理。这排除了使用 I2RS 来配置设备或管理设备配置的可能性。
• Ephemeral State: I2RS interacts with two sets of information that are normally built through distributed routing protocols—the RIB, internal BGP, IS-IS, OSPF, and other routing protocol tables. Network engineers do not expect this information to survive a reboot—in fact, a route in the routing table surviving a reboot is generally considered a bad thing (unless it’s preserved through some protocol mechanism such as graceful restart), as stale routing information may not match the current network topology. I2RS, then, never installs any permanent information in network devices. If state must persist through a device reboot, the process of reinstalling must be managed by an I2RS agent. This rules out using I2RS for configuring a device, or managing the configuration of a device.
虽然在撰写本文时该决定尚未最终确定,但 I2RS 似乎将决定使用 YANG 模型来表示 RIB 和路由协议表的状态,并结合 RESTCONF 接口来获取该信息。这些模型仍在开发中,并且不确定 RESTCONF 是否能够支持 I2RS 的近实时要求。
While the decision isn’t final at the time of this writing, it appears I2RS will resolve to using YANG models for the state of the RIB and routing protocol tables, combined with a RESTCONF interface into that information. The models are still in development, and it’s uncertain that RESTCONF will be able to support the near real-time requirements of I2RS.
驱动可编程网络的概念和用例在任何意义上都并不是什么新鲜事,但当前的开放 API、开放协议、以数据模型为中心的网络 API 理念正在为网络工程世界带来新鲜空气。从手动配置转向表达策略,然后通过一系列机器对机器进行管理和实施接口是网络控制平面感知和理解方式的彻底革命。
The concept and use cases driving the programmable network aren’t really new in any meaningful sense, but the current open API, open protocol, data model focused idea of a network API is breathing fresh air into the world of network engineering. Moving from manual configuration to expressing a policy that is then managed and implemented through a series of machine to machine interfaces is a complete revolution in the way network control planes are perceived and understood.
这个世界如何应对复杂性?最简单的答案是,由于复杂性显然被隐藏在 API 层、简单接口以及表达策略和状态的新方法之下,网络将变得非常简单。然而,这与现实世界中复杂性的经验或理论不符——复杂性可以移动,但无法消除。下一章将更详细地探讨这些问题。
How does this world fare against complexity? The simplest answer is that because complexity is apparently buried under a layer of APIs, simple interfaces, and new ways of looking at expressing policy and state, networks will become dramatically simpler. This doesn’t match the experience or theory of complexity in the real world, however—complexity can be moved, but it can’t be eliminated. The next chapter will examine these questions in more detail.
正如第 9 章“可编程网络”中所述,可编程网络的吸引力有两个方面:通过集中式路由决策降低复杂性,以及允许应用程序和编排与控制平面交互的能力。考虑到系统网络的复杂性,这些承诺如何实现?本章从网络复杂性的角度研究了可编程网络——网络可编程性在哪里真正降低了复杂性,在哪里增加了复杂性?本章的主要目标是考虑一系列权衡,这将有助于指导在考虑部署可编程网络技术时提出的问题。
The allure of the programmable network, as outlined in Chapter 9, “Programmable Networks,” is two-fold: the reduction in complexity enabled by centralized routing decisions, and the ability to allow applications and orchestration to interact with the control plane. How do these promises pan out when considered in light of systemic network complexity? This chapter examines programmable networks in terms of network complexity—where does network programmability really decrease complexity, and where does it add complexity? The primary goal of this chapter is to consider a set of tradeoffs that will help guide the questions to ask when considering deploying programmable network technologies.
对辅助原则的简短概述将构成以下讨论的框架;虽然这一原则实际上是一种治理结构,但它在网络工程领域使用的许多原则和概念中得到了呼应。接下来,将检查四个特定领域,以考虑可编程网络给每个领域带来的权衡:策略管理、控制平面故障域、控制平面和数据平面的分离以及基于应用程序的控制对交互概念的影响表面。这四个并不代表每一个可能的研究领域,并且其中每一个都不适用于每种类型的网络可编程性,但它们确实提供了围绕复杂性领域的权衡的广泛的横截面思想。
A short overview of the subsidiarity principle will frame the following discussion; while this principle is actually a governance construct, it is echoed in many principles and concepts used in the world of network engineering. Following this, four specific areas will be examined to consider the tradeoffs programmable networks bring to each one: policy management, control plane failure domains, the separation of the control and data planes, and the impact of application-based control on the concept of interaction surfaces. These four do not represent every possible area of investigation, and every one of these is not applicable to every type of network programmability, but they do provide a broad cross section of thought around tradeoffs in the realm of complexity.
每一个都将根据本书早期开发的模型进行考虑,并在整个过程中使用——对状态、速度和表面有什么影响?就最佳转发和网络资源使用而言,任何给定解决方案的权衡是什么?
Each of these will be considered in light of the model developed early in this book, and used throughout—what is the impact on the state, the speed, and the surfaces? And what is the tradeoff for any given solution in terms of optimal forwarding and use of network resources?
几乎每个从事协议设计实现领域工作的网络工程师都听说过端到端原理。Saltzer 在 1984 年发表的一篇论文中首次阐述了这一原则:
Virtually every network engineer working in the area of protocol design implementation has heard of the end-to-end principle. Saltzer first articulated this principle in a paper published in 1984:
在包含通信的系统中,通常会在通信子系统周围绘制一个模块化边界,并在其与系统其余部分之间定义一个牢固的接口。这样做时,很明显存在一个功能列表,每个功能都可以通过以下几种方式中的任何一种来实现:由通信子系统、由其客户端、作为合资企业、或者可能是冗余的,每个功能都执行自己的版本。在推理这种选择时,应用程序的要求为以下类别的论点提供了基础:只有在通信系统端点的应用程序的知识和帮助下,才能完全正确地实现所讨论的功能。因此,提供该有问题的功能作为通信系统本身的特征是不可能的,而且,给通信系统的所有客户端带来性能损失。(有时,通信系统提供的函数的不完整版本可能有助于提高性能。)我们将这种反对低级函数实现的推理称为端到端论证。1
In a system that includes communications, one usually draws a modular boundary around the communication subsystem and defines a firm interface between it and the rest of the system. When doing so, it becomes apparent that there is a list of functions each of which might be implemented in any of several ways: by the communication subsystem, by its client, as a joint venture, or perhaps redundantly, each doing its own version. In reasoning about this choice, the requirements of the application provide the basis for the following class of arguments: The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the endpoints of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible, and moreover, produces a performance penalty for all clients of the communication system. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.) We call this line of reasoning against low-level function implementation the end-to-end argument.1
1 . JH Saltzer、DP Reed 和 DD Clark,“系统设计中的端到端论证”,ACM Transactions on Computer Systems 2,第 2 期。4(1984):277-278。
1. J. H. Saltzer, D. P. Reed, and D. D. Clark, “End-to-End Arguments in System Design,” ACM Transactions on Computer Systems 2, no. 4 (1984): 277–278.
端到端原则对于计算机网络世界来说可能是独一无二的,但它实际上是一个更大原则的特定领域的重述。在社会和政府界,同样的概念被称为辅助性(subsidiarity),这意味着尽可能通过本地控制来解决本地问题,或者将控制点和决策点尽可能移近问题所在。一般的想法是,问题和控制器之间的通信线路是一个自然拥挤的空间,适合聚合,但聚合的信息越多,任何给定的解决方案在解决当前问题时的效率就越低。这可以称为“本地信息、本地控制”。
The end-to-end principle might appear unique to the computer network world, but it’s actually a domain specific restatement of a larger principle. In social and governmental circles, the same concept is called subsidiarity, which means to solve local problems with local control where possible—or to move the control and decision point as close to the problem as possible. The general idea is that the line of communication between the problem and the controller is a naturally congested space that lends itself to aggregation, but the more aggregated information is, the less effective any give solution will be in solving the problem at hand. This might be called “local information, local control.”
在“本地信息、本地控制”的背景下,在考虑如何以及在何处控制特定过程时,工程师应该问:
Within the context of “local information, local control,” when considering how and where to place to control over a specific process, engineers should ask:
• 网络上的哪个设备拥有有关流程状态的最准确信息?
• What device on the network has the most accurate information about the state of the process?
• 如果过程的控制随状态移出设备,则必须传输哪些信息、以什么频率以及如何传输?
• What information must be transported, at what frequency, and how, if control of the process is moved off the device with the state?
从端到端原则的角度考虑这一点 - 在错误和流量控制方面,网络中的哪些设备对任何特定数据流的状态具有最准确和最新的视图通过网络流动?发送和接收主机。因此,实际发送和接收任何给定数据流的主机应该对重传和控制流量具有最大的控制权。虽然路由器、交换机和中间盒可以尝试从流经的流状态推断信息,但此类设备没有特定的方法可以了解任何特定流的所有状态。
Considering this from the perspective of the end-to-end principle—in terms of error and flow control, which device(s) in the network have the most accurate and up-to-date view of the state of any specific stream of data flowing through the network? The transmitting and receiving host. Hence, the hosts actually transmitting and receiving any given data stream should have the most control over retransmitting and controlling the flow of traffic. While routers, switches, and middle boxes can try to infer information from the flow state as it passes through, there is no certain way for such devices to know all of the state for any particular flow.
同样的原理也适用于控制平面 - 直接连接到构成网络拓扑的链路(包括可到达的目的地)的设备更有可能更快地了解该状态的变化并做出反应。
This same principle applies in the control plane as well—devices that are directly connected to the links making up the network topology (including the reachable destination) are more likely to know about, and react to, changes in that state more quickly.
虽然这确实会将智能推向网络边缘,但这并不一定意味着每个状态都需要尽可能分散。相反,这意味着需要仔细检查每个状态,以便将该状态的控制放在最合乎逻辑的位置,而最合乎逻辑的位置通常是信息来源且最完整的位置。通过根据保持状态接近状态本身的起源,或者更确切地说保持控制接近可用的最完整状态来重新构建端到端原则,端到端原则可以在整个系统中应用协议和网络设计。
While this does tend to push intelligence to the edge of the network, it doesn’t necessarily mean every piece of state needs to be distributed as much as possible. Rather, it means each piece of state needs to be carefully examined to put the control of that state in the most logical place—and the most logical place is often going to be where the information originates and is the most complete. By reframing the end-to-end principle in terms of keeping state close to the origin of the state itself, or rather keeping control close to the most complete state available, the end-to-end principle can be applied across the system in both protocol and network design.
策略可能是网络工程中最难管理的问题之一。然而,在深入讨论复杂性与政策之间的权衡之前,需要回答一个问题:什么是政策?也许解决这个问题的最佳方法是查看一些示例,从业务需求开始,一直到对网络设计的实际影响。
Policy is probably one of the most difficult issues to manage in network engineering. Before diving into a long discussion over the tradeoffs between complexity and policy, however, one question needs to be answered: what is policy? Perhaps the best way to address this question is by looking at some examples, starting from business requirements and moving through to the practical impact on the network design.
•灵活的设计:企业希望网络设计能够“从容应对”。如果业务扩展、收缩或改变方向,他们不想在整个系统中进行叉车式升级。在设计层面这是如何实现的?使用横向扩展设计原则——模块化网络和工作负载,可以根据需要进行扩展,并且可以重新部署资源,而无需进行重大重新设计。构建灵活网络的一种常见解决方案是通过模块化,这又意味着网络内定义明确、可互换的模块。第二种解决方案是尽可能多地从网络中提取智能,以便网络对广泛的任务变得有用,而不是狭隘地专注于一小部分问题。
• Flexible Design: Businesses want a network design that will “roll with the punches.” If the business expands, contracts, or changes direction, they don’t want to perform a forklift upgrade across their systems. How is this accomplished at a design level? Using scale out design principles—modularized networks and workloads that can be expanded as needed and can redeploy resources without a major redesign. One common solution for building a flexible network is through modularity, which in turn means well-defined, interchangeable modules within the network. A second solution is to pull the intelligence out of the network as much as possible, so the network becomes useful for a wide array of tasks, rather than being narrowly focused on a small set of problems.
•高投资回报:这与灵活设计有些重叠,但它在许多网络设计中也是一个单独的概念。例如,如果有多余的可用容量,那么在网络中的两个位置之间购买一组新的链路就没有什么意义,只是不是沿着最短路径。为了提高网络利用率,许多运营商采用各种形式的流量工程,从最短路径的角度将流量放置在非最优路径上,以达到更高的网络整体利用率。投资回报率还涉及设备成本。是购买易于更换的小型设备(例如一组 1RU 交换机),还是购买更大的多刀片机箱系统更好?
• High Return on Investment: This overlaps somewhat with flexible design, but it also stands as a separate concept in many network designs. For instance, there is little point in purchasing a new set of links between two locations in the network if there is excess capacity available, just not along the shortest path. To improve network utilization, many operators use various forms of traffic engineering, placing traffic on less than optimal paths from a shortest path perspective to achieve higher overall network utilization. ROI involves equipment cost, as well. Is it better to purchase smaller devices (such as a set of 1RU switches) that can be easily replaced, or a larger multiblade chassis system?
•业务连续性:企业无法承受停机——在业务应用程序世界中,时间绝对就是金钱。在设计层面如何实现这一点?模块化网络以区分复杂性、打破故障域,是用于提供业务连续性的主要机制。防御拒绝服务 (DoS) 攻击在业务连续性中也发挥着重要作用,同样需要可以测量流量并制定策略的点。
• Business Continuity: Businesses cannot withstand downtime—time, in the business application world, is definitely money. How is this accomplished at the design level? Modularizing the network to divide complexity from complexity, to break up failure domains, is the primary mechanism used to provide business continuity. Defense against Denial of Service (DoS) attacks also plays a major role in business continuity, again requiring points where traffic can be measured, and policies instituted.
•安全信息:企业依靠安全信息来运营、获得战略优势并保护与客户的信任关系。信息安全如何转化为网络设计?主要是通过虚拟化、访问策略和其他技术来包含数据,这些技术允许管理员管理谁和什么可以访问特定信息。
• Secure Information: Businesses rely on secure information for operations, strategic advantage, and to protect their trust relationship with their customers. How does information security translate into network design? Primarily through containing data through virtualization, access policy, and other techniques that allow administrators to manage who and what can access specific information.
•应用支持:许多应用对于底层传输中的抖动、延迟和带宽有特定的要求。为了实现这些目标,许多网络部署了订阅不足(过度构建可用带宽)、流量工程(将流量从频繁使用的链路转移到轻度使用的链路)和服务质量(修改数据包排队、队列调度和拥塞管理的方式)的组合。 )。
• Application Support: Many applications have specific requirements around jitter, delay, and bandwidth in the underlying transport. To achieve these goals, many networks deploy a combination of undersubscription (overbuilding the available bandwidth), traffic engineering (moving traffic from heavily to lightly used links), and quality of service (modifying the way packets are queued, queues scheduled, and congestion managed).
这些例子中有哪些共同点?在数据平面中,允许数据包通过网络以及处理通过网络的数据包是两个常见要素。那么,在数据平面中,大多数(或全部)策略解析为实际转发流量的每跳处理或每跳行为。
What are the common elements in these examples? In the data plane, permitting packets to pass through the network, and the handling of those packets passing through the network, are the two common elements. In the data plane, then, most (or all) policy resolves to the per hop handling, or the per hop behaviors, of actually forwarding traffic.
控制平面更有趣一些。模块化设计、打破故障域、流量工程和控制数据访问之间有什么共同点?在所有这些情况下,流量可能会从网络中两点之间的最短路径转移到某个较长的路径,以满足某些策略目标。图 10.1说明了这个概念。
The control plane is a bit more interesting. What is common between modular design, breaking up failure domains, traffic engineering, and controlling data access? In all these cases, traffic is potentially moved from the shortest path between two points in the network to some longer path to meet some policy goal. Figure 10.1 illustrates this concept.
• 路径 [B,F] 将是两台服务器 A 和 G 之间的最短路径(仅基于跳数,本示例中均假设),但它不是为了创建模块边界而安装的,从而破坏了单个网络拓扑分为两个故障域,并创建两个单独的网络模块。
• The path [B,F] would be the shortest path (based on the hop count alone, assumed throughout this example) between the two servers A and G, but it was not installed to create a module boundary, breaking a single network topology into two failure domains, and creating two separate network modules.
• 路径[B,D,F] 是实际可用的最短路径,但它已过载,或接近其最大利用率。
• The path [B,D,F] is the shortest path actually available, but it is overloaded, or close to its maximum utilization.
• 路径[B,C,E,F] 是用于在两个服务器A 和G 之间承载流量的实际路径。该路径比实际最短路径长一跳,比最短潜在路径长两跳。
• The path [B,C,E,F] is the actual path used to carry traffic between the two servers A and G. This path is one hop longer than the actual shortest path, and two hops longer than the shortest potential path.
在此图中,已权衡最佳流量以实施策略,然后该策略支持某些特定的业务需求。
In this illustration, optimal traffic flow has been traded off to implement policy, which then supports some specific business requirement.
一个简单的经验法则是:每当在网络中配置策略时,流量很可能会在某些目的地之间采用非最佳路径。反过来说,政策可以描述为:
A simple rule of thumb is: any time policy is configured in a network, there is a high likelihood that traffic will take a less than optimal path between some pair of destinations. Reversing this, policy can be described as:
• 将流量从网络中的最低成本路径移出(或可能移出)的任何机制,以满足某些被认为比简单遵循最低成本路径更重要的目标。
• Any mechanism that moves (or potentially moves) traffic off the lowest cost path in the network to meet some goal deemed more important than simply following the lowest cost path.
• Any mechanism that prevents traffic from entering the network, or blocks a particular flow from passing through the network.
笔记
Note
第 6 章“管理设计复杂性”中的控制平面状态与拉伸部分提供了策略与最佳路径的更多说明,并讨论了模块化、聚合和其他与控制平面复杂性相关的设计元素之间的关系。
The Control Plane State versus Stretch section in Chapter 6, “Managing Design Complexity,” provides more illustrations of policy versus the optimal path, as well as discussing the relationship between modularity, aggregation, and other design elements in relation to control plane complexity.
第 4 章“操作复杂性”在“策略分散与最佳流量处理”部分中提出了有关策略分散与网络复杂性的问题。该节指出,策略部署到网络边缘(或者更确切地说,必须应用策略的一组流的源头)越近,网络资源的使用就越优化,并且对安全漏洞和安全的保护就越多。各种类型的攻击,都会实现。
Chapter 4, “Operational Complexity,” poses a problem about policy dispersion versus network complexity in the section, “Policy Dispersion versus Optimal Traffic Handling.” That section noted that the closer policy is deployed to the edge of the network (or rather, the source of a set of flow to which the policy must be applied), more optimal the use of network resources, and more protection against security breaches and various types of attacks, will be achieved.
针对这个问题提出的解决方案是通过网络自动部署策略。注意到与该提议的解决方案相关的几点,包括脆弱性。
The solution proposed to this problem was automated deployment of the policy through the network. Several points were noted in relation to this proposed solution, including brittleness.
基于机器的系统可以做出更一致的反应,但它们的一致性既是好事也是坏事。例如,在一组测量中看似相同的两个事件实际上可能相差很大;攻击者可以了解反应模式,并改变攻击方式以利用它们。对同时发生的多个事件的不可预见的反应组合可能会导致软件无法解决的故障模式。
Machine-based systems can react more consistently, but their consistency is a bad thing as well as good. Two events that appear to be the same from one set of measurements might actually be far different, for instance; attackers can learn the pattern of reaction, and change their attacks to take advantage of them. An unforeseen combination of reactions to multiple events occurring at the same time can cause failure modes software cannot resolve.
可编程网络可以通过连续监控多个联锁系统并包含更复杂和细致的反应,为这些问题提供更完整的解决方案。例如,可编程网络可能能够响应网络条件来修改应用程序的行为,而不仅仅是修改网络的配置或某些链路或转发设备的配置。通过应用程序级接口,可编程网络还可以准确地发现特定应用程序想要执行的操作,而不是在网络中嵌入尝试发现此信息的启发式方法。
Programmable networks can provide a more complete solution to these problems by continuously monitoring multiple interlocking systems and encompassing more complex and nuanced reactions. For instance, a programmable network may be able to modify an application’s behavior in response to network conditions, rather than just modifying the configuration of the network, or some set of links or forwarding devices. Through an application level interface, a programmable network can also discover exactly what a particular application would like to do, rather than embedding heuristics into the network that attempt to discover this information.
可编程网络还可以通过更接近实时的转发平面状态接口提供比传统管理系统更大量的连续状态。通过在路由表或转发表上进行连接,可编程网络中的控制器可以与主要依赖于接口和设备级状态的管理系统相比,可以更快地识别和适应网络条件,并且复杂性更低。
Programmable networks can also provide a larger amount of continuous state than a traditional management system through their closer to real-time interfaces into the state of the forwarding plane. By interfacing either at the routing table or forwarding table, controllers in a programmable network can recognize and adapt to network conditions more quickly, and with less complexity, than a management system that primarily relies on interface and device level state.
当然,这也是需要权衡的。为了在近乎实时的基础上实现最佳控制,对网络的控制越精细,网络必须管理的状态就越多。图 10.2说明了可编程网络状态。
There are tradeoffs, of course. The more finely controlled the network is managed to achieve optimal control on a near real-time basis, the more state the network must manage. Figure 10.2 illustrates the programmable network state.
假设出于某种策略原因,源自服务器 A 并目的地为服务器 H 的流量应沿着路径 [B,C,D,F,G] 流动,而不是沿着通过该网络的任何其他路径流动,并且显示的每个跃点都有成本为 1. 使用传统网络配置和基于目的地的路由:
Assume traffic originating from Server A and destined to Server H—for some policy reason—should flow along the path [B,C,D,F,G], rather than along any other path through this network, and that each hop shown has a cost of 1. Using traditional network configuration and destination-based routing:
• 路由器 B 必须配置为检查源自服务器 A 并发往服务器 H 的任何数据包的流量,并将此类流量重定向至路由器 C,而不是沿着通过路由器 E 的最短路径。
• Router B must be configured to examine traffic for any packets sourced from Server A and destined to Server H, and redirect such traffic toward Router C, rather than along the shortest path through Router E.
• 路由器 C 必须配置为检查源自服务器 A 并发往服务器 H 的任何数据包的流量,并将此类流量重定向至路由器 D,而不是沿着通过路由器 E 的最短路径。
• Router C must be configured to examine traffic for any packets sourced from Server A and destined to Server H, and redirect such traffic toward Router D, rather than along the shortest path through Router E.
• 路由器 D 必须配置为检查源自服务器 A 并发往服务器 H 的任何数据包的流量,并将此类流量重定向至路由器 F,而不是沿着通过路由器 E 的最短路径。
• Router D must be configured to examine traffic for any packets sourced from Server A and destined to Server H, and redirect such traffic toward Router F, rather than along the shortest path through Router E.
这些配置中的每一个都代表通过网络手动或通过某些自动化系统分散的策略。实际上,该策略位于“网络运营商心目中的控制平面”中,而不是用于管理网络的实际可编程控制平面中。在可编程网络中,该信息将在网络的实际控制平面中捕获,从而增加通过网络承载的状态量。
Each of these configurations represents policy that is dispersed, either manually or through some automation system, through the network. The policy is, in effect, in the “control plane in the mind of the network operator,” rather than in the actual programmable control plane used to manage the network. In a programmable network, this information would be captured in the actual control plane of the network, increasing the amount of state being carried through the network.
不过,状态并不是唯一受影响的复杂性衡量标准。一旦通过编程接口在控制平面中捕获此策略,它的更改频率将比通过手动配置更频繁。因此,可编程网络的发展也提高了通过控制平面传播信息的速度(以响应政策的变化)。
State is not the only complexity measure impacted, though. Once this policy is captured in the control plane through a programmatic interface, it will be changeable more often than through manual configuration. Thus, the drive toward programmable networks also increases the speed at which information must be disseminated through the control plane (in reaction to changes in policy).
笔记
Note
本示例中实现的策略是沿着非最佳(度量方式)路径传输流量通过网络,可以通过多种方式实现,包括将流量引入隧道或 LSP。
The policy implemented in this example to carry traffic along a less than optimal (metric wise) path through the network could be implemented in any number of ways, including ingressing the traffic into a tunnel or LSP.
最后,可编程接口是网络中各种设备之间交互面的广度和深度的增加;这将在下面的“ Surface 和可编程网络”部分中进行介绍。
Finally, the programmable interface is an increase in the breadth and depth of the interaction surfaces between various devices in the network; this is covered in the section “Surface and the Programmable Network,” that follows.
在整个网络中分发策略的问题之一是各种转发设备使用大量的控制和管理接口。专门用于状态数据包过滤的设备或服务将具有与主要设计用于转发数据包的设备截然不同的配置选项和功能。可编程接口将允许智能控制器查询网络中的每个设备,并确定在何处以及如何以最佳方式应用任何给定策略,而无需考虑每个设备上提供的用户界面。即使编程 API 在不同设备上不一致,为机器可读界面生成翻译器也与创建“屏幕抓取工具”来发现和管理状态有很大不同。
One of the problems with distributing policy throughout a network is the large number of control and management interfaces used by the wide array of forwarding devices. A device or service that specialized in stateful packet filtering will have far different configuration options and capabilities than a device that is primarily designed to forward packets. A programmable interface would allow an intelligent controller to query each device in the network and determine where and how to apply any given policy in an optimal way without regard to the user interface provided at each device. Even if the programming API is inconsistent across devices, producing translators for a machine readable interface is much different than creating “screen scrapers” to discover and manage state.
可编程性还可以通过将许多复杂策略直接转换为控制器上的控制平面状态(而不是通过每个设备的本地接口)来解决第二个复杂性来源。返回图 10.2,假设:
Programmability can also resolve a second source of complexity by converting many complex policies directly into control plane state at the controller, rather than through the local interface of each device. Returning to Figure 10.2, assume:
• 路由器B 是由一个供应商制造的设备,并使用基于策略的路由来实施所讨论的策略。这需要配置一组策略,将这些策略包装在一组转发规则中,然后将这些策略应用到设备上的相关接口。
• Router B is a device manufactured by one vendor, and uses Policy-Based Routing to implement the policy discussed. This entails configuring a set of policies, wrapping those policies in a set of forwarding rules, and then applying the policies to the relevant interfaces on the device.
• 路由器C 是第二个供应商制造的设备,并使用基于过滤器的转发来实施所讨论的策略。这需要配置一组策略,将这些策略包装在一组转发规则中,然后将这些策略应用到设备上的相关接口。
• Router C is a device manufactured by a second vendor, and uses Filter-Based Forwarding to implement the policy discussed. This entails configuring a set of policies, wrapping those policies in a set of forwarding rules, and then applying the policies to the relevant interfaces on the device.
• 路由器D 是一台交换机,必须配置为检查数据包中携带的第3 层信息(而不是第2 层信息)才能执行所需的操作。
• Router D is a switch that must be configured to examine the Layer 3 information, rather than Layer 2 information, carried in the packet to perform the operations required.
• 路由器E 实际上是一个有状态的数据包过滤器。必须通过安全策略配置此设备的基于源的规则。
• Router E is actually a stateful packet filter; source-based rules for this device must be configured through security policies.
这些设备中的每一个都需要不同的接口(和一组逻辑结构)来实现所需的结果。然而,如果每个设备都有一个可编程接口,那么在一个或多个控制器上运行的单个软件就可以将策略提取为每个设备的本地转发规则,并在每个设备上安装所需的转发规则。实际上,这集中了策略的逻辑,同时通过使用转发平面代理代替路径中每个设备上的策略本身来分发策略的结果。涉及的每项策略都在一个位置(在控制器上)交互,而不是在设备之间交互。
Each of these devices requires a different interface (and set of logical constructs) to achieve the desired outcome. If each them had a programmable interface, however, a single piece of software, running on one or more controllers, could distill the policy into local forwarding rules for each device, and install the forwarding rules required on each one. This, in effect, centralizes the logic of the policy while distributing the result of the policy by using forwarding plane proxies in place of the policy itself on each device in the path. Each of the policies involved interacts in one place—on the controller—rather than among devices.
此类操作再次将更多状态放入控制平面中。曾经在每台设备上本地配置的策略现在可以在控制平面内捕获,尽管形式很简单。这还提高了控制平面在设备之间分散信息的速度,因为操作员直接与每个设备中的转发表进行交互,而不是与每个设备上的面向管理的人类可读用户界面进行交互。
Once again, this type of operation places more state into the control plane. The policy that was once configured locally on each device is now captured within the control plane, albeit in a simplified form. This also increases the speed at which the control plane must disperse information among device, as the operator is interfacing directly with the forwarding tables in each device, rather than with the slower management oriented human readable user interface on each device.
这个例子还显示了交互表面的增加;每个设备现在必须与控制器交互,该控制器提供经过提炼的策略版本以供在各种转发表中使用,而不是根据本地配置独立操作。但与此同时,这个交互界面从“网络运营商的思维”转移到了控制器中,使其更易于管理。
This example also shows an increase in the interaction surface; each device must now interact with a controller that provides a version of the policy distilled for use in the various forwarding tables, rather than acting independently based on local configuration. At the same time, however, this interaction surface moves from “the mind of the network operator” into the controller, making it more manageable.
在经济学领域,有一个概念叫道德风险,其定义为:
In the field of economics, there is a concept called moral hazard, which is defined as:
在经济学中,当一个人因为其他人承担这些风险的负担而承担更多风险时,就会出现道德风险。。。。道德风险发生在一种信息不对称的情况下,即交易中承担风险的一方比承担风险后果的一方更了解其意图。2
In economics, moral hazard occurs when one person takes more risks because someone else bears the burden of those risks. . . . Moral hazard occurs under a type of information asymmetry where the risk-taking party to a transaction knows more about its intentions than the party paying the consequences of the risk.2
2 . “道德风险”,参考资料,维基百科,np,最后修改时间为 2015 年 6 月 6 日,访问日期为 2015 年 6 月 20 日, https: //en.wikipedia.org/wiki/Moral_hazard。
2. “Moral Hazard,” Reference, Wikipedia, n.p., last modified June 6, 2015, accessed June 20, 2015, https://en.wikipedia.org/wiki/Moral_hazard.
当抽象或简化网络策略领域的复杂性时,也存在类似的危险。一旦策略集中到一组代码中,就没有具体的理由来维持构建转发平面策略的纪律,或者考虑安装新的每跳行为的替代方案来解决手头的任何问题。例如,考虑早期 EIGRP 部署的情况。由于许多工程师相信 EIGRP 可以处理任何网络拓扑,因此他们设计了没有聚合和故障域的 EIGRP 网络。在一些大规模 BGP 部署中也会发生同样的情况;该协议看起来足够强大,可以处理任何事情,那么为什么还要费心去处理故障域呢?
A similar hazard exists when abstracting or simplifying complexity in the realm of network policy. Once policy is centralized into a single set of code, there’s no specific reason to maintain discipline in building the forwarding plane policy, or to consider alternatives to installing new per hop behaviors to solve any problem at hand. As an example, consider the case of early EIGRP deployments. Because many engineers believed EIGRP could handle any network topology, they designed EIGRP networks with no aggregation and no failure domains. Much the same happens in some large-scale BGP deployments; the protocol appears robust enough to handle anything, so why bother with failure domains?
集中化策略,特别是以可编程网络的形式,可能会导致同样的草率。毕竟,自从 EIGRP 发明和首次部署以来,人性并没有发生太大变化。当最容易使用的工具是锤子时,每个问题很快就会变成钉子。
Centralized policy, particularly in the form of the programmable network, can result in the same sort of sloppiness; human nature hasn’t changed much since EIGRP was invented and first deployed, after all. When the easiest tool to use is the hammer, every problem quickly becomes a nail.
前面的部分主要关注可编程网络中的状态和速度,但是表面呢?图 10.3将用于检查这部分复杂性难题。
The preceding sections focused primarily on state and speed in the programmable network, but what about surface? Figure 10.3 will be used to examine this piece of the complexity puzzle.
使用分布式控制平面和管理系统时:
When using a distributed control plane and a management system:
• 始发主机和第一跳路由器之间的表面1 主要是主机A 在数据包中放置的任何信息的函数。例如,源地址、目标地址、服务类型和其他因素实际上在主机和路由器之间承载信息,暗示这些设备之间的交互,因此也意味着表面。然而,这个交互表面可以说是相当浅的,因为它包含信息最少,主机和路由器都没有任何有关对方设备内部状态的信息。
• Surface 1, between the originating host and the first hop router, is primarily a function of any information Host A places in the packet. For instance, the source address, the destination address, the type of service, and other factors actually carry information between the host and the router, implying an interaction between these devices, and therefore a surface. This interaction surface, however, can be said to be rather shallow, as it contains minimal information, and neither the host nor the router have any information about the internal state of the other device.
• 两个路由器之间的表面2 实际上是两个表面。第一个是沿路径转发的数据包内部携带的信息,它与主机 A 和路由器 B 之间的表面具有相同的特征。第二个是路由协议(或其他分布式机制,旨在通过路由协议携带可达性和拓扑信息)网络)。路由器 B 和 C 之间的第二个表面具有更深度,因为每个设备实际上都在交换有关内部状态的某种级别的信息,并且更广度,因为该表面实际上跨越参与故障域内控制平面的每个转发设备。
• Surface 2, between the two routers, is actually two surfaces. The first is the information carried inside the packets being forwarded along the path, and it has the same characteristics as the surface between Host A and Router B. The second is the routing protocol (or other distributed mechanism designed to carry reachability and topology information through the network). This second surface between Routers B and C has more depth, as each device is actually trading some level of information about internal state, and more breadth, as this surface actually reaches across every forwarding device participating in the control plane within the failure domain.
• 路由器 C 和服务器 D 之间的 Surface 3 与 Surface 1 类似。
• Surface 3, between Router C and Server D, is similar to Surface 1.
•表面4和7(通常)不存在于主要依赖于分布式控制平面的网络中。
• Surfaces 4 and 7 don’t (normally) exist in a network primarily dependent on a distributed control plane.
• 在主要具有分布式控制平面的网络中,表面 5 和 6 代表管理接口。虽然该表面既深(因为它承载内部设备状态)又宽(因为管理设备通常与连接到网络的大多数或所有设备进行交互),但该表面不会以任何有意义的方式直接到达转发信息。从转发的角度来看,这个表面其实很浅。该表面实际上足够浅,通常无法对其进行监控,而对实际网络操作的影响很小(尽管不建议这样做!)。
• Surfaces 5 and 6, in a network with a primarily distributed control plane, represent a management interface. While this surface is both deep, because it carries internal device state, and broad, because the management device typically interacts with most or all of the devices connected to the network, this surface doesn’t reach directly into the forwarding information in any meaningful way. From the forwarding perspective, this surface is actually shallow. This surface is actually shallow enough that it can often not be monitored with little impact on actual network operation (although this wouldn’t be recommended!).
然而,在考虑可编程网络时:
When considering a programmable network, however:
• 表面1 和3 保持不变。
• Surfaces 1 and 3 remain the same.
• 表面2 可以保持不变或降低复杂性。如果控制平面完全集中式,而不是以混合模式运行(分布式控制平面与根据策略要求修改转发信息的集中式系统一起运行),则路由器 B 和 C 之间唯一的表面交互将是在数据包级别,很像 Surface 1。
• Surface 2 can either remain the same or have a reduction in complexity. If the control plane is completely centralized, rather than running in a hybrid mode (the distributed control plane operates alongside a centralized system that modifies forwarding information based on policy requirement), the only surface interaction between Routers B and C would be at the packet level, much like Surface 1.
• Surface 4 和 Surface 7 可能包含大量近乎实时的状态,例如应用程序需要传输多少信息、服务质量要求、应用程序应根据网络状态传输信息的速度等。这两个接口那么,表面可以变得更深更广。
• Surfaces 4 and 7 may contain a good deal of state in near real time, such as how much information the application needs to transfer, quality of service requirements, how fast the application should transfer information based on the network’s state, etc. These two surfaces, then, can become much deeper and broader.
• 表面5 和6 现在包含实际的转发状态,无论是存在的还是应该如何,在控制器和转发设备之间传送。这两个表面将变得更深,因为每个设备的内部转发状态都会被控制器公开和修改。
• Surfaces 5 and 6 now contain actual forwarding state, both as it exists and how it should be, being carried between the controller and the forwarding devices. These two surfaces will become much deeper, as the internal forwarding state of each device is exposed and modified by the controller.
• 如果控制器以混合模式部署,其中分布式控制平面近乎实时地发现可达性和拓扑,而集中式系统覆盖控制平面来实施策略,则两个控制平面之间还会有一个新的交互界面。必须近乎实时地将可达性或拓扑的变化通知集中式控制平面,然后根据需要计算和实施策略更改,以保留施加在网络上的策略。这个交互面既深又广。
• If the controller is deployed in a hybrid mode, where a distributed control plane discovers reachability and topology in near real time, and a centralized system overlays the control plane to implement policy, there is also a new interaction surface between the two control planes. The centralized control plane must be notified of changes in reachability or topology in near real time, and then calculate and implement policy changes as needed to preserve the policies imposed on the network. This interaction surface is both deep and broad.
表面状态中一个有趣的问题是新创建的反馈循环,例如沿着表面 3、6 和 7。网络始终能够影响应用程序状态。然而,一旦应用程序能够影响网络状态,就会形成反馈循环。虽然这种反馈循环可以产生强大的积极影响,特别是以近乎实时的方式将应用程序需求注入到控制平面的操作中,以及允许应用程序近乎实时地“查看”网络状况,但这种反馈循环可能是危险的以及。如果稳定性是一个目标,那么部署可编程网络的网络设计人员必须特别注意这一点。
An interesting problem in the surface states is the newly minted feedback loops, for instance along surfaces 3, 6, and 7. The network has always been capable of impacting application state. Once the application is capable of impacting network state, however, a feedback loop forms. While such feedback loops can have a strong positive impact, particularly in the form of injecting applications requirements into the control plane’s operation in near real time, and in allowing applications to “see” network conditions in near real time, such feedback loops can be dangerous as well. This is an area the network designer deploying programmable networks must pay careful attention to if stability is a goal.
笔记
Note
第 8 章“复杂系统如何失效”讨论了此类反馈回路的各种失效模式。
Chapter 8, “How Complex Systems Fail,” discussed the various failure modes of such feedback loops.
通过不同系统(以及这些系统中的多个部分)的松散耦合来分离故障域是大规模网络设计的基本概念。可编程网络在这方面有影响吗?图 10.4说明了故障域和可编程网络。
Failure domain separation through loose coupling of different systems (and multiple pieces within these systems) is a foundational concept in large-scale network design. Does the programmable network have an impact in this area? Figure 10.4 illustrates failure domains and programmable networks.
笔记
Note
控制器 B 实际上与主干和叶结构中所示的每个交换机都有连接;为了清楚起见,这些被显示为一个组,而不是单独的连接。
Controller B actually has a connection with each of the switches shown in the spine and leaf fabric; for clarity, these are shown as a group, rather than individual connections.
Four specific instances of failure domain impact need to be examined here.
在图左侧的分布式设计中,内部可达性和拓扑信息由 IGP 底层承载,而外部(或边缘)可达性在 BGP 中作为覆盖层承载。这将边缘(或外部)可达性与内部可达性分开,提供两个不同的故障域,两者之间的耦合相当松散。通过隐藏可达性、拓扑信息或配置其他形式的信息隐藏,IGP 可以进一步沿拓扑边界分解为多个故障域。
In the distributed design on the left side of the diagram, internal reachability and topology information is carried by an IGP underlay, while external (or edge) reachability is carried in BGP as an overlay. This separates edge (or external) reachability from internal reachability, providing two different failure domains with fairly loose coupling between the two. The IGP can further be broken up into multiple failure domains along topological boundaries by hiding reachability, topology information, or configuring other forms of information hiding.
当部署控制器来替换或增强分布式控制平面时,这些不同的故障域将被单个故障域替换。这是否是一个积极的发展取决于控制器处理故障的能力、管理的策略数量以及其他因素,但故障域数量的减少违背了多年经验的最佳常见实践网络设计,并且需要认真考虑作为权衡。
When a controller is deployed to either replace or augment the distributed control plane, these various failure domains are replaced with a single failure domain. Whether this is a positive development or not depends on the ability of the controller to handle faults, the amount of policy being managed, and other factors—but the reduction in the number of failure domains goes against the best common practice of years of experience with network design, and needs to be seriously considered as a tradeoff.
如果有两个控制器而不是一个怎么办?管理同一组设备的两个控制器必须通过某些方式实现状态同步机制 - 因此它们实际上会形成一个单一的故障域来响应许多常见错误。例如,如果携带控制信息的格式错误的数据包导致第一个控制器发生故障,也可能导致第二个控制器发生故障。
What if there are two controllers, rather than one? Two controllers managing the same set of devices must be synchronized in terms of state through some mechanism—hence they will actually form a single failure domain in response to many common errors. For instance, if a malformed packet carrying control information causes the first controller to fail, it will likely cause the second to fail, as well.
笔记
Note
可以采用更加联合的控制器设计,其中每个控制器可以向多个转发设备发送指令,并且每个设备上使用某种机制来确定应该接受哪个控制器的输入。这种类型的机制代表了信息隐藏的另一种形式,因为接受哪个状态以及应该从每个控制器接受哪些策略的决定是一种分布式状态形式。当然,这种状态仍然必须以某种方式进行配置或管理——复杂性不会“消失”,它只是从网络的一个区域转移到另一个区域。
It is possible to have a more federated controller design, where each controller can send instructions to several forwarding devices, and some mechanism is used on each device to determine which controller’s input should be accepted. This type of mechanism would represent another form of information hiding, as the decision of which state to accept, and which policies should be accepted from each controller, is a distributed form of state. This state must, of course, still be configured or managed in some way—the complexity doesn’t “go away,” it just moves from one area of the network to another.
图中所示的分布式协议设计将每个叶行和主干行显示为单独的 BGP 自治系统,这是超大规模数据中心中相当常见的设计(特别是支持基于云的商业工作负载)。由于任何给定行(叶或脊)内的网络设备不互连,因此每个 BGP 发言者仅维护 eBGP 连接。在此类设计中,必须以两种方式看待故障域:
The distributed protocol design as presented in the illustration has each leaf row and the spine row shown as a separate BGP autonomous system—a fairly common design in very large-scale data centers (especially supporting commercial cloud-based workloads). Because network devices within any given row (either leaf or spine) are not interconnected, each BGP speaker only maintains eBGP connections. Failure domains must be seen in two ways in this type of design:
• eBGP 以其松散耦合而闻名,通常用于连接两个不同的故障域。因此,在这种设计中,从控制平面的角度来看,每个路由器实际上都是一个单独的“故障域”,尽管这看起来可能违反直觉。
• eBGP is known for its loose coupling and is often used to connect two different failure domains. In this design, then, each router is effectively a separate “failure domain” from a control plane perspective, as counter-intuitive as that might seem.
• 然而,整个BGP 路由系统是一个单一的故障域。
• The entire BGP routing system is, however, a single failure domain.
替代方案可能是简单的 IP 或 IP/MPLS 底层以及单独的 BGP、VXLAN 或其他覆盖。用单个控制器或一组冗余控制器替换其中任何一个,都会再次将原来的一组松散耦合系统变成一个系统,从而减少故障域的数量。
An alternative might be a simple IP or IP/MPLS underlay with a separate BGP, VXLAN, or other overlay. Replacing either of these with a single controller, or a set of redundant controllers, once again turns what is originally a set of loosely coupled systems into one system, reducing the number of failure domains.
笔记
Note
在大型数据中心中使用 BGP 进行路由 (draft-ietf-rtgwg-bgp-routing-large-dc) 详细描述了在大型数据中心结构中使用 BGP 作为独立协议。
Use of BGP for Routing in Large-scale Data Centers (draft-ietf-rtgwg-bgp-routing-large-dc) describes using BGP as a standalone protocol in large-scale data center fabrics in detail.
这里的一种选择是使用一个控制器作为底层,另一个作为覆盖层,将它们分成单独的故障域,以恢复原始设计中的一些分离。
One option here is to use one controller for the underlay, and another for the overlay, breaking them into separate failure domains, to recover some of the separation in the original design.
在不可编程的网络中,网络上运行的应用程序和控制平面之间几乎没有交互,因此应用程序和控制平面代表完全解耦或松散耦合的故障域(主要耦合在数据包区域)标头在应用程序和网络设备之间携带信息)。在这两个系统之间放置交互面实际上会创建一个更大的故障域,因为某些故障模式可以跨越控制器/应用程序鸿沟。
In a network that is not programmable, there is little interaction between applications running on the network and the control plane—hence the application and the control plane represent either completely decoupled or loosely coupled failure domains (with the primary coupling being in the area of packet headers carrying information between the application and the network devices). Putting an interaction surface between these two systems actually creates a larger failure domain, as there are failure modes that can reach across the controller/application divide.
此外,想必大多数在网络上运行的应用程序都将具有这样的接口,从而提供另一个通道,许多复杂的系统可以通过该通道以不可预见的方式进行交互,因此应用程序本身可能会无意中通过控制器耦合到单个故障域中。例如,两个不同的应用程序可能会在一组特定链接的服务质量设置上发生“冲突”,最终导致一系列事件,导致两个应用程序之一(或两者)失败。
Further, presumably most of the applications running on the network will have such an interface, providing yet another channel through which many complex systems can interact in unforeseen ways—hence the applications themselves can be unintentionally coupled into a single failure domain through the controller. As an example, two different applications might “fight” over the quality of service settings for a particular set of links, eventually causing a chain of events that causes one of the two applications (or both) to fail.
必须仔细观察控制器 A 和 B 之间的连接,以确保它们之间保持松耦合。如果这两个控制器无意中形成紧密耦合的对,则整个网络将成为单个故障域,这与所有良好的设计原则背道而驰。
The connection between controllers A and B must be watched carefully to ensure loose coupling is maintained between them. If these two controllers unintentionally form a tightly coupled pair, the entire network becomes a single failure domain—contrary to all good design principles.
最后一个需要注意的领域是带内信令,它将对转发设备的访问与转发设备本身的状态紧密耦合在一起。如果构建集中控制,请考虑带外控制通道来创建真正稳定的网络,并记住考虑构建和维护带外网络本身的复杂性。
A final area to be careful of is in band signaling, which tightly couples access to the forwarding devices to the state of the forwarding devices themselves. Consider out of band control channels to create a truly stable network if building for centralized control—and remember to consider the complexities of building and maintaining the out of band network itself.
简而言之,重要的是要考虑通过更集中的网络控制来扩大故障域所涉及的权衡。请仔细检查每个交互表面,特别是寻找紧密耦合的潜在实例,并思考在这些区域中实施松散耦合的方法。如果要更换整个分布式控制平面,一种好方法是运行多个控制器,每个控制器都为网络的一个特定拓扑区域或一个虚拟拓扑提供可达性。以尽可能松散的耦合方式连接控制器——传统的 BGP 可能是一个不错的选择。
In short, it’s important to consider the tradeoffs involved in enlarging failure domains through more centralized control of the network. Take care to examine each interaction surface carefully, particularly looking for potential instances of tight coupling, and think through ways of enforcing loose coupling in these areas. One good way, if the entire distributed control plane is being replaced, is to run multiple controllers, each one providing reachability for either one specific topological area of the network or one virtual topology. Connect the controllers with the loosest coupling possible—traditional BGP could be a good choice.
可编程网络肯定是每个网络工程师的未来——向数千台设备分发策略的复杂性和管理在任何大规模环境中很快就难以管理。然而,急于集中化需要与复杂性权衡的认真考虑相平衡。
Programmable networks are surely in the future of every network engineer—the complexity and management of distributing policy to thousands of devices is quickly unmanageable in any large-scale environment. The rush to centralize, though, needs to be balanced with serious consideration around the complexity tradeoffs.
对于复杂性来说,没有灵丹妙药。
There is no silver bullet for complexity.
传统的网络设计为可编程网络的新世界提供了很多经验教训,从分离故障域到思考如何以及在何处将策略分配给潜在的反馈循环。思考这些问题的一种有用方法是考虑控制平面的分层模型,如图10.5所示。
Traditional network design holds a lot of lessons for the new world of programmable networks, from separating failure domains to thinking through how and where to distribute policy to potential feedback loops. One useful way to think through these problems is to consider a layered model for the control plane, as illustrated in Figure 10.5.
在此模型中,控制平面处理四个基本任务:
In this model, there are four essential tasks a control plane handles:
• 发现拓扑和可达性
• Discovering the topology and reachability
• 确定连接到网络的每对可到达节点之间的最短路径
• Determining the shortest path between every pair of reachable nodes attached to the network
• 隐藏信息或构建故障域
• Hiding information or building failure domains
• 交通工程
• Traffic engineering
这四个功能中的每一个都会在网络的每个“层”重复。作为一个简单的示例,考虑底层/覆盖:
Each of these four functions is repeated at each “layer” of the network. As a simple example, consider an underlay/overlay:
• 底层协议中将有最短路径和拓扑发现。
• There will be the shortest path and topology discovery in the underlay protocol.
• 信息隐藏和流量工程也将通过聚合、洪泛域边界、(可能)快速重新路由和其他机制在底层实现。
• Information hiding and traffic engineering will be implemented in the underlay, as well, through aggregation, flooding domain boundaries, (potentially) fast reroute, and other mechanisms.
• 在与底层分离的覆盖控制平面中将存在最短路径和拓扑发现。
• There will be the shortest path and topology discovery in the overlay control plane that is separate from the underlay.
• 信息隐藏和流量工程也将在覆盖服务链中实现,下一章将对此进行研究,这可能是一个例子。
• Information hiding and traffic engineering will also be implemented in the overlay—service chaining, examined in the next chapter, might be one example.
通过将控制平面分解为功能而不是协议,工程师可以更好地了解网络控制平面中的每个协议应该做什么,并构建功能层,从而在功能和复杂性之间提供最佳平衡。这类似于网络工程师在协议栈和应用程序中构建层以管理复杂性和规模问题的方式。
By breaking the control plane out into functions rather than protocols, engineers can get a better feel for what each protocol in a network control plane should do and build layers of functionality that will provide an optimal balance between functionality and complexity. This is similar to the way network engineers have built layers into protocol stacks and applications to manage complexity and scale problems.
在网络工程的早期,可以用两个词来描述网络上运行的应用程序:简单和少。有几种不同的方式来传输文件,几种与其他人交谈的方式(例如电子邮件和留言板),也许还有一些其他值得注意的应用程序。在这些理想时代,端到端原则占据主导地位,主机仅与其他主机(或称为服务器的更大主机)以及偶尔的大型机或小型机(还记得那些吗?)进行通信。
In the early days of network engineering, two words could be used to describe the applications running over the network: simple and few. There were several different ways to transfer files, a few ways to talk to someone else (such as email and message boards), and perhaps a few other applications of note. In these ideal times, the end-to-end principle reigned supreme, with hosts only talking to other hosts (or bigger hosts called servers) and the occasional mainframe or mini (remember those?).
这些年来,这种情况已经发生了变化;每一代新一代的用户和企业都对网络需要支持什么提出了新的想法,并对网络本身提出了要求。
This has changed over the years; each new generation of users and businesses have added new ideas about what networks need to support and laid requirements on the network itself.
首先,有防火墙,可以保护数据和系统免受攻击。然后是网络监控工具,以及入侵检测系统形式的深度数据包检查器。为了节省长途链路的持续运营费用,在路径中添加了广域加速器。单个服务器无法处理向其施加的负载,因此插入了负载平衡器。随着时间的推移,为了解决 IPv4 地址空间日益稀缺的问题,网络地址转换器被推上了道路。运行一组服务的设备安装在整个网络中,有效地将智能从终端主机转移到网络本身——主机不知道的线路中的碰撞现在操纵、折叠和旋转几乎每个在任何地方传输的数据包单跳。应用程序现在在网络“上”运行,从某种意义上说,将网络不同部分连接在一起的 API 实际上是通过网络本身来解析的。三种不同的原因导致这种通过添加设备来添加服务的模式崩溃:
First, there were firewalls, which protect data and systems from attackers. Then the network monitoring tools, and the deep packet inspectors in the form of intrusion detection systems. To save ongoing operational expenses on long haul links, wide area accelerators were added to the path. A single server couldn’t handle the load thrown at it, so load balancers were inserted. Over time, to account for the increasing scarcity of IPv4 address space, network address translators were pushed into the path. Appliances running a set of services were installed throughout the network, effectively moving intelligence out of the end hosts and into the network itself—bumps in the wire the host didn’t know about now manipulate, fold, and spindle almost every packet transmitted anywhere beyond a single hop. Applications now run “on” the network, in the sense that the APIs that connect the different pieces of the network together are actually resolved across the network itself. Three different things cause this model of adding appliances to add services to break down:
• 整个网络中策略分发的复杂性增加。第 4 章“操作复杂性”对策略分散与复杂性进行了更长时间的讨论,但本质上,在整个网络中传播的策略比集中在少数流程或位置的策略更难管理。
• The added complexity of policy distribution throughout the network. Chapter 4, “Operational Complexity,” has a more lengthy discussion on policy dispersion versus complexity, but essentially policy spread throughout the network is much more difficult to manage than policy concentrated in a few processes or places.
• 安装和管理设备的额外成本。每个设备不仅代表成本,还代表电力、空间、布线以及必须管理的整个生命周期。所有这些都会增加运营和/或资本成本。
• The added cost of installing and managing appliances. Each appliance not only represents a cost, but also represents power, space, cabling, and an entire lifecycle that must be managed. Each of these things adds operational and/or capital costs.
• 网络虚拟化。支持多个租户、更深入的安全分段以及更有效地使用网络需要虚拟化。由于多种原因,通过设备推送虚拟化电路(而不是物理电路)部署和管理要困难得多。例如,虚拟拓扑几乎可以移动到网络中的任何位置,而无需考虑拓扑(事实上,不管拓扑如何)。将基于设备的服务从一个位置移动到另一个位置,甚至只是将设备中包含的状态从一个位置移动到另一个位置,都是有问题的(如果不是不可能的话)。
• The virtualization of the network. Supporting multiple tenants, deeper segmentation for security, and more efficient use of the network requires virtualization. Pushing virtualized circuits through an appliance, rather than physical ones, is much more difficult to deploy and manage for several reasons. For instance, a virtual topology can be moved almost anywhere in the network without regard to topology (in fact, in spite of topology). Moving an appliance-based service from one location to another, or even just the state contained in the appliance from one location to another, is problematic (if not impossible).
那么为什么不将服务也虚拟化呢?这就是网络功能虚拟化 (NFV) 发挥作用的地方。本章将讨论服务虚拟化的几个方面,包括检查服务被虚拟化的一些具体情况,以及服务链的概念,服务链是将流量移动到这些虚拟化服务(在这些服务之前它们被放入数据包流中)所必需的。 )。这里没有直接讨论复杂性的话题;相反,它是为下一章保留的,该章明确处理围绕服务虚拟化的复杂性的权衡。
So why not virtualize the services as well? This is where Network Function Virtualization (NFV) enters the picture. This chapter will discuss several aspects of service virtualization, including examining some of the specific cases where services are being virtualized, and the concept of service chaining, which is required to move traffic to these virtualized services (where they were placed into the packet stream before). The topic of complexity isn’t addressed directly here; rather it is reserved for the next chapter, which explicitly deals with the tradeoffs in complexity around the virtualization of services.
1994 年,一群工程师成立了Network Translation公司,并设计了第一台 PIX 防火墙。安全功能从一开始就很明显,但被认为是次要的——PIX 最初是为了进行网络地址转换 (NAT) 而开发的。面对围绕私有互联网地址分配1和传统 IP 网络地址转换器2的讨论, IPv4 地址空间开始出现耗尽的迹象。思科于 1995 年底收购了 PIX,并在 2008 年之前发布并维护了该产品的新版本。
In 1994, a group of engineers formed the company Network Translation and designed the first PIX firewall. The security features were evident from the start, but considered secondary—the PIX was first developed to do Network Address Translation (NAT). In the face of discussions around Address Allocation for Private Internets1 and Traditional IP Network Address Translator,2 the IPv4 address space was showing the first signs of running out. Cisco acquired the PIX in late 1995, shipping and maintaining the product with new revisions until 2008.
1 . Y. Rekhter 等人,“私有互联网地址分配”(IETF,1996 年 2 月),2015 年 9 月 24 日访问,https: //datatracker.ietf.org/doc/rfc1918/。
1. Y. Rekhter et al., “Address Allocation for Private Internets” (IETF, February 1996), accessed September 24, 2015, https://datatracker.ietf.org/doc/rfc1918/.
2 . P. Srisuresh 和 K. Egevang,“传统 IP 网络地址转换器”(IETF,nd),2015 年 9 月 24 日访问,https://www.rfc-editor.org/rfc/rfc3022.txt。
2. P. Srisuresh and K. Egevang, “Traditional IP Network Address Translator” (IETF, n.d.), accessed September 24, 2015, https://www.rfc-editor.org/rfc/rfc3022.txt.
思科技术支持中心的工程师们(就像大多数工程师一样)充满好奇心,很快就拆开了运往当地实验室的第一批 PIX 设备的外壳,并发现了英特尔处理器(适用于大多数型号)和标准以太网芯片组。
Engineers in the Cisco Technical Assistance Center, being curious folks (as most engineers are), quickly pulled the cover off the first few PIX devices shipped to their local labs, and discovered an Intel processor (for most models) and standard Ethernet chipsets.
笔记
Note
PIX 使用基于 Intel/Intel 兼容的主板构建;PIX 501 使用 AMD 5 × 86 处理器,所有其他独立型号均使用 Intel 80486 至 Pentium III 处理器。几乎所有 PIX 都使用配备 Intel 82557、82558 和 82559 网络控制器的以太网 NIC,但某些较旧的型号偶尔会配备 3COM 3c590 和 3c595 以太网卡、基于 Olicom 的令牌环卡和基于 Interphase 的光纤分布式数据接口 (FDDI) )卡。3
The PIX was constructed using Intel-based/Intel-compatible motherboards; the PIX 501 used an AMD 5 × 86 processor, and all other standalone models used Intel 80486 through Pentium III processors. Nearly all PIXes used Ethernet NICs with Intel 82557, 82558, and 82559 network controllers, but some older models are occasionally found with 3COM 3c590 and 3c595 Ethernet cards, Olicom-based Token-Ring cards, and Interphase-based Fiber Distributed Data Interface (FDDI) cards.3
3 . “Cisco PIX — 维基百科,免费百科全书”,np,2015 年 4 月 30 日访问,https://en.wikipedia.org/wiki/Cisco_PIX。
3. “Cisco PIX—Wikipedia, the Free Encyclopedia,” n.p., accessed April 30, 2015, https://en.wikipedia.org/wiki/Cisco_PIX.
还有一块定制芯片,原始型号中的 PIX-PL 和后来型号中的 PIX-PL2。PIX-PL 加速加密是因为用于一般处理的 Intel 处理器无法足够快地交换数据包。从某种意义上说,这些 PIX-PL 芯片 [以及其他硬件加速专用集成电路 (ASIC)] 是 NFV 叙述的核心。与前两代 PIX 一样,在流量中间部署设备(作为中间盒)的最令人信服的原因是为了在盒中设计定制 ASIC。
There was also a single piece of custom silicon, the PIX-PL in the original models and the PIX-PL2 in later models. The PIX-PL accelerated encryption because the Intel processors used for general processing couldn’t switch packets fast enough. These PIX-PL chips [and other hardware acceleration Application Specific Integrated Circuit (ASICs)] are, in a sense, at the heart of the NFV narrative. As with the first two generations of PIX, the most convincing reason to deploy an appliance in the middle of a traffic flow (as a middle box) is for the custom ASICs designed into the box.
笔记
Note
许多供应商销售设备而不是可安装的软件服务,其原因并非包括定制 ASIC,例如控制服务的性能、提供简单的许可、包含比通常可用设备更高端的处理器在网络中用于加密等功能,或者只是为了增加增量收入。
A number of vendors sell appliances rather than installable software services for reasons other than the included custom ASICs, such as controlling the performance of the service, to provide for simple licensing, the inclusion of some higher end processor than is available on the equipment normally available in a network for functions such as cryptography, or simply to increase incremental revenue.
问题是:设备中已包含的通用处理器何时可以接管所有数据包处理职责,因此设备何时可以被更通用/通用的设备取代?到了这个地步2008 年之后、2015 年(撰写本文的那年)之前的某个时间——尽管对于大多数网络应用程序来说,过渡没有“普遍商定的日期”。
The question is: When can the general purpose processor already included in the appliance take over all the packet processing duties, and hence when can the appliance be replaced by a more generic/general purpose device? This point was reached sometime after 2008, but before 2015 (the year of this writing)—though there’s no “generally agreed on date” for the transition, for most network applications.
笔记
Note
上述语句中的关键词适用于大多数网络应用。从某种意义上说,网络只是“对人类不耐烦的让步”。只有少数事情可以通过跨越海洋的高速链路来完成,而使用标准运营商的装有大量固态硬盘的隔夜箱(理论上)无法完成这些事情。但人类是无限缺乏耐心的,因此总会有一些事情通用处理器的速度不够快,因此设备和定制 ASIC 很可能有很长的寿命。更可能的结果是,随着时间的推移,此类设备的数量将保持相当稳定,但随着网络规模和范围的扩大,定制硅交换硬件占销售和部署硬件的百分比将下降。这都是一个权衡的问题——你要么错过了一些关于环境、问题的关键信息,或解决方案。如果你只从这本书中学到一件事,那就应该是:总是需要权衡。如果您没有看到任何权衡,那么您就没有认真考虑。
The crucial words in the statement above are for most network applications. Networks are, in a sense, simply a “concession to human impatience.” There are only a few things that can be done by a high-speed link running across an ocean that can’t (in theory) be done using a standard carrier’s overnight box holding a large number of solid-state drives. But humans are infinitely impatient, so there will always be some things for which the speed of a general purpose processor won’t be fast enough, so appliances and custom ASICs most likely have a long life ahead of them. The more likely result is that the number of such devices will remain fairly steady over time, but as networks increase in size and scope, the percentage of custom silicon switching hardware will decrease as a percentage of hardware sold and deployed. This is all a matter of tradeoffs—you’ve either missed something crucial about the environment, the problem, or the solution. If you take only one thing away from this book, it should be this: there are always tradeoffs. If you don’t see any tradeoffs, then you’re not looking hard enough.
一旦快速通用处理器的趋势满足了网络和应用程序的虚拟化,一个问题就变得显而易见:为什么这些应用程序应该在设备上运行?事实上,运行深度数据包检测、状态过滤、负载平衡以及网络中的所有其他服务都可以在“足够快”的通用处理器上运行,因此可以在已经驻留在网络中的通用计算和存储资源上运行。网络。
Once this trend of fast general purpose processors meets the virtualization of networks and applications, a question becomes obvious: Why should these applications run on appliances? In fact, running deep packet inspection, stateful filtering, load balancing, and all other services that live within the network can run on “fast enough” general purpose processors, and therefore be run on the general purpose compute and storage resources already residing in the network.
那么,NFV 解决了两组问题,或者更确切地说,处于两种不同趋势的交汇处:
NFV, then, solves two sets of problems—or rather, is at the confluence of two different trends:
• 网络虚拟化的爆炸式增长,以及为由此产生的虚拟网络提供“网络内”服务的愿望。
• The explosion of virtualization in the network, and the desire to provide “in the network” services to the resulting virtual networks.
• 用更便宜(并且更好理解)的通用计算资源来取代包含定制数据包交换硬件的昂贵物理设备。
• Replacing expensive physical appliances containing custom packet switching hardware with less expensive (and better understood) general purpose computing resources.
为了让这一切变得更加清晰,通过一个用例来展示这两种趋势如何合并成一个概念是很有用的。以图11.1所示的网络为例。
To make all this a little more clear, it’s useful to work through a use case showing how these two trends merge into a single concept; the network illustrated in Figure 11.1 will be used as an example.
图 11.1显示了向大量主机提供电子邮件服务通常所需的网络连接和服务。流量通过网络的路径可以追踪为:
Figure 11.1 shows the network connectivity and services commonly required to provide email services to a large number of hosts. The path of traffic through the network can be traced as:
1.连接到路由器 A 或 B 的主机向两台服务器之一发送电子邮件数据包,如图右侧所示
1. A host attached to either Router A or B sends an email packet toward one of the two servers, shown on the right side of the illustration
2.首先,流量必须经过其中一台路由器,才能进入网络,施加任何 QoS 标记,终止任何隧道(例如 IPsec SA 或 MPLS VRF),并执行任何基本过滤(例如基于源的欺骗过滤器)。
2. First, the traffic must pass through one of the routers, to be ingressed into the network, have any QoS markings imposed, terminate any tunnels (such as an IPsec SA or an MPLS VRF), and any basic filtering performed (such as source-based spoofing filters).
3.然后,流量被传递到防火墙设备,在此可以采取多种操作。例如,使流通过状态数据包过滤器,根据现有流验证每个数据包,或执行深度数据包检查以检查流内容中是否存在攻击或恶意软件。NAT(或更可能是端口地址转换)是一种常见操作,至少对于 IPv4 而言是如此。
3. Then the traffic is passed to a firewall appliance, where several actions may be taken. For instance, passing the flow through a stateful packet filter that validates each packet against an existing stream, or performing deep packet inspection to check for the presence of an attack or malware in the flow’s contents. A common operation, at least with IPv4, is NAT (or, more likely Port Address Translation).
4.一旦流量通过第一个设备,它就会被传递到路由器,然后路由器将流量引导到电子邮件服务器所在的提供商网络部分。图中显示的下一跳是负载均衡器。
4. Once the traffic is through the first appliance, it is passed to a router, which then directs the traffic into the part of the provider network where the email servers are located. The next hop shown in the diagram is a load balancer.
5.在这种特殊情况下,单个服务器无法支持所有客户端的负载,因此网络运营商配置了多个服务器,每个服务器都可以访问相同的数据库后端。每个邮件服务器是,然后,操作上相同,可以访问同一组邮件存储。负载均衡器是另一个设备,它确定哪台服务器的负载量最少,因此应该处理任何新的传入请求。此服务要求负载均衡器维护每个连接和每个服务器的状态。
5. In this particular situation, a single server cannot support the load of all the clients, so the network operator has configured a number of servers, each with access to the same database backend. Each of the mail servers is, then, operationally identical, with access to the same set of mail stores. The load balancer, another appliance, determines which server has the least amount of load, and hence should handle any new incoming requests. This service requires the load balancer to maintain state about each connection and each server.
6.流量从负载均衡器传递到邮件服务器,但一路上必须经过垃圾邮件过滤器(图中用 SF 标记)。这可能是一个设备,也可能是在邮件服务器本身上运行的进程或应用程序。
6. The traffic is passed from the load balancer to the mail server, but along the way must pass through a Spam Filter (marked with SF in the diagram). This might be an appliance, or it might be a process or application running on the mail server itself.
在每种情况下,服务实际上是在设备上运行的应用程序;当设备在主机和服务器之间或网络中的两个端点之间移动时,必须将设备放置在流量路径中。将设备放入网络,并对网络进行布线,使得某种类型的流量必须通过它,可以称为手动服务插入。在网络设计中考虑此类服务插入的一个有用方法是将服务引入到流量中。
In each case, the service is actually an application running on an appliance; the appliance must be placed in the path of the traffic as it moves between the host and the server, or between the two endpoints in the network. Putting an appliance into the network, and wiring the network such that traffic of a certain type must pass through it, can be called manual service insertion. A useful way to think about this type of service insertion in the network design is bringing the service to the traffic flow.
如果这些服务可以虚拟化并在相当标准的计算和存储资源中运行会怎样?一种选择是在整个网络中放置标准计算和存储资源,并在需要拦截通过网络的流量时创建实例。这将节省的主要运营费用是设备本身的成本——也许这里可以节省一些费用,但这种模式仍然在很大程度上取决于提供服务的扩展模式,而不是扩展模式。
What if these services could be virtualized, and run in fairly standard compute and storage resources? One option would be to place standard compute and storage resources throughout the network, and create instances where needed to intercept traffic flows as they pass through the network. The primary operational expense this would save is in the cost of the appliances themselves—perhaps there is some savings here, but this model still largely depends on a scale up, rather than a scale out, model of providing services.
与其使用纵向扩展模型,将这些服务插入到分散在整个网络中的通用硬件上,为什么不转向横向扩展模型呢?这正是 NFV 背后的理念;图 11.3说明了网络中的虚拟化功能。
Instead of using a scale up model, inserting these services onto generic hardware scattered throughout the network, why not move to a scale out model? This is precisely the idea behind NFV; Figure 11.3 illustrates virtualizing functions in the network.
在图 11.3中,通常在设备中执行的每个组件服务都已成为一个单独的进程,并且在附加到数据中心结构的通用计算和存储上运行。例如,不是将设备内联配置来执行状态数据包检查,而是将执行相同功能的处理元件池简单地连接到数据中心结构。同样的方式NAT、SES和邮件服务器本身都转换为在附加到数据中心结构的通用计算和存储资源上运行的进程。
In Figure 11.3, each of the component services normally performed in an appliance has been made into a separate process, and is being run on generic compute and storage attached to the data center fabric. Rather than having appliances configured inline that perform stateful packet inspection, for instance, a pool of processing elements that perform the same function are simply attached to the data center fabric. In the same way NAT, SES, and the mail servers themselves are all converted to processes running on generic compute and storage resources attached to the data center fabric.
可以虚拟化并放置在数据中心结构上(或更常见的说法是“在云中”)的服务范围很广。例如:
The range of services that can be virtualized and placed on the data center fabric, or “in the cloud,” in more common parlance, is wide. For instance:
• 移动设备的服务终止(包括隧道、运营和业务服务、计费以及对这些设备的全方位支持)可以虚拟化到数据中心结构中。
• The termination of services for mobile devices, including tunnels, operational and business services, billing, and the full range of support for these devices can be virtualized into a data center fabric.
• 在虚拟化横向扩展流程中,而不是在定制设备中,通常可以更有效地处理内容复制和重复。
• Content replication and duplication can often be more efficiently handled in virtualized scale out processes, rather than in customized appliances.
此时以这种方式扩展网络服务的问题应该是显而易见的。通过扩展模型,网络工程师可以在流量从主机传递到服务器或从主机传递到主机时将设备放置到流量路径中。在横向扩展 NFV 模型中,流量在某种程度上仍然需要通过这些相同的服务,但没有明显的方法可以通过基于数据包的网络和简单的基于目的地的转发来强制执行此类流量。
The problem with scaling out network services in this way should be obvious at this point. With the scale up model, the network engineer could place appliances into the path of the traffic as it passed from host to server, or from host to host. In the scale out NFV model, traffic somehow still needs to pass through these same services, but there’s no obvious way to enforce this sort of traffic flow through a packet-based network with simple destination-based forwarding.
服务链为这个问题提供了解决方案。
Service chaining provides a solution to this problem.
服务链依赖于修改通过网络覆盖层的流路径(或沿着虚拟拓扑中的隧道路径),以通过确保网络运营商的策略得到执行所需的一组服务来引导流量。图 11.4返回到前面的示例,显示了如何使用服务链沿着所需的路径推送流量,以确保流量以正确的顺序通过所有正确的服务。
Service chaining relies on modifying the path of a flow through a network overlay—or along a tunneled path in a virtual topology—to channel traffic through the set of services required to ensure the network operator’s policies are enforced. Figure 11.4 returns to the previous example, showing how a service chain can be used to push the traffic along a path required to ensure traffic passes through all the correct services in the right order.
在这种情况下,来自主机的流量通过路由器 A,并放置在某种带有端点的隧道中,以便它到达该池内的数据包检查进程之一。每个数据包经过处理后,都会以某种方式进行封装,使流量被传送到 NAT 进程,然后传送到进行垃圾邮件过滤的进程,最后传送到邮件服务器本身。在每个步骤中,数据包都不会根据目标地址进行转发(这将使数据包直接通过结构到达邮件服务器,即其最终目的地),而是转发到链中的下一个服务。
In this case, traffic arriving from the host is passed through Router A, and placed in some sort of tunnel with an endpoint so it arrives at one of the packet inspection processes within that pool. Once each packet has been processed, it is encapsulated in a way that causes the traffic to be carried to a NAT process and then on to a process that does spam filtering, and finally to the mail server itself. At each of these steps, the packet is not forwarded based on the destination address—which would take the packet directly through the fabric to the mail server, its final destination—but rather to the next service in the chain.
这是如何实现的?可以使用三种基本模型来形成服务链:
How is this accomplished? There are three basic models that can be used to form a service chain:
• 初始服务可以由链上的入口设备施加,在本例中为路由器A。一旦数据包到达第一个服务在链中,第一个服务(或本地管理程序虚拟交换机)将正确的封装强加到数据包上,以将其传送到下一个服务。施加到链上的下一个服务是由服务本身内的一些本地策略决定的。一旦最终服务处理了数据包,它就会根据目标地址简单地转发每个数据包。
• The initial service can be imposed by the ingress device on the chain, which is Router A in this case. Once the packet arrives at the first service in the chain, the first service (or the local hypervisor virtual switch) imposes the correct encapsulation onto the packet to carry it to the next service. The next service to impose on the chain is determined by some local policy within the service itself. Once the final service has handled the packet, it simply forwards each packet based on the destination address.
• 初始服务和所有后续服务可以由结构内的交换设备施加。这与上面的第一个选项类似,但链中的每个段都是由网络中的设备(例如架顶式交换机)施加的,而不是由服务进程本身施加的。
• The initial service, and all subsequent services, can be imposed by the switching devices within the fabric. This is similar to the first option, above, but each segment in the chain is imposed by devices in the network (Top-of-Rack switches, for instance), rather than by the service processes themselves.
• 初始服务和所有后续服务可以由数据包遇到的第一个设备(例如,云部署情况下的数据中心 (DC) 边缘交换机)施加。在这种情况下,边缘交换机会获得有关发往任何特定服务的数据包必须通过的每个服务的信息,以及构建可“堆叠”到数据包上的标头链的某种方式,该标头将通过该链携带每个数据包的服务。
• The initial service, and all subsequent services, can be imposed by the first device the packet encounters—the Data Center (DC) edge switch in the case of a cloud deployment, for instance. In this case, the edge switch is given information about every service through which packets destined to any particular service must pass, and some way to build a chain of headers that can be “stacked” onto the packet that will carry each packet through this chain of services.
下面是每种类型部署的示例。
Examples of each of these types of deployments follow.
IETF 于 2013 年底特许成立服务功能链 (SFC) 工作组:
The IETF chartered the Service Function Chaining (SFC) working group in late 2013 to —:
。。。记录服务交付和运营的新方法。它将产生一个服务功能链架构,其中包括必要的协议或协议扩展,以将服务功能链和服务功能路径信息传递给参与服务功能和服务功能链实施的节点,以及引导机制通过服务功能实现流量。4
. . . document a new approach to service delivery and operation. It will produce an architecture for service function chaining that includes the necessary protocols or protocol extensions to convey the Service Function Chain and Service Function Path information to nodes that are involved in the implementation of service functions and Service Function Chains, as well as mechanisms for steering traffic through service functions.4
4 . “服务功能链宪章”,np,2015 年 5 月 9 日访问,https://datatracker.ietf.org/wg/sfc/charter/。
4. “Service Function Chaining Charter,” n.p., accessed May 9, 2015, https://datatracker.ietf.org/wg/sfc/charter/.
SFC工作组没有定义传统意义上的封装——SFC报头不包含任何有关数据包的源或目的地的信息,因此不能用于转发。相反,SFC 标头包含一条服务链,可以由链内的服务或沿路径知道如何解包和解释此标头的设备对其进行操作。
The SFC working group does not define encapsulations in the traditional sense—the SFC header does not contain any information about the source or destination of a packet, so it cannot be used for forwarding. Rather the SFC header contains a chain of services that can be acted on by either the services within the chain or by devices along the path that know how to unpack and interpret this header.
图 11.5说明了 SFC 的架构。
Figure 11.5 illustrates the architecture of SFC.
SFC 规范不假设任何底层隧道机制,但工作很大程度上是在 VXLAN(记录在宽带性能大规模测量的参考路径和测量点 5 中记录)或 NVGRE [记录在网络中使用通用路由封装扩展 (draft-sridharan-virtualization-nvgre) 的虚拟化将成为正在使用的“共同基础”,或者可能是默认隧道协议。图 11.6说明了使用这些隧道协议之一通过 SFC 启用的覆盖层链接数据包的正常过程。
The SFC specifications do not assume any underlying tunnel mechanism, but the work has largely taken place with the base assumption that either VXLAN (documented in A Reference Path and Measurement Points for Large-Scale Measurement of Broadband Performance5) or NVGRE [documented in Network Virtualization using Generic Routing Encapsulation Extensions (draft-sridharan-virtualization-nvgre)] will be the “common base,” or perhaps default tunneling protocol, in use. Figure 11.6 illustrates what is considered a normal process for chaining packets through a SFC enabled overlay using one of these tunneling protocols.
5 . M. Bagnulo 等人,“宽带性能大规模测量的参考路径和测量点”(IETF,2015 年 2 月),2015 年 9 月 24 日访问,https: //www.rfc-editor.org/rfc/ rfc7398.txt。
5. M. Bagnulo et al., “A Reference Path and Measurement Points for Large-Scale Measurement of Broadband Performance” (IETF, February 2015), accessed September 24, 2015, https://www.rfc-editor.org/rfc/rfc7398.txt.
在此图中:
In this illustration:
1. DC 边缘路由器从某个主机接收到数据包。该设备检查数据包,并确定它是发往连接到结构的邮件服务器的流的一部分,并强加一个具有正确服务集的服务链标头,该数据包必须通过该服务集进行路由,然后才能传递到邮件服务器本身。一旦施加服务链,DC边缘路由器将数据包封装到VXLAN、NVGRE或另一个隧道,并将数据包转发到结构内的第一跳,即状态数据包检测服务。
1. A packet is received from some host by the DC edge router. This device examines the packet, and determining it is part of a flow destined to a mail server connected to the fabric, imposes a service chain header with the correct set of services through which this packet must be routed before it can be delivered to the mail server itself. Once the service chain is imposed, the DC edge router encapsulates the packet into a VXLAN, NVGRE, or other tunnel, and forwards the packet toward the first hop inside the fabric, the stateful packet inspection service.
2.数据包由配置了数据包检查服务进程或容器的刀片服务器的管理程序中运行的虚拟交换机 (VSwitch) 接收。网络管理员已将服务链配置为在 VSwitch 中终止,而不是在状态数据包检查进程本身中终止,因为这些进程不了解 SFC,因此 VSwitch 充当服务链上这些服务的“代理”。
2. The packet is received by the virtual switch (VSwitch) running in the hypervisor of the blade server on which the packet inspection service processes or containers are configured. The network administrator has configured the service chain to terminate in the VSwitch rather than the stateful packet inspection processes themselves because these processes are not SFC aware—so the VSwitch is acting as a “proxy” for these services along the service chain.
3. VSwitch 对数据包进行解封装,以该进程能够理解的格式将其呈现给状态数据包检查进程之一。一旦数据包被处理,数据包就会被交换到邮件服务器,邮件服务器将其通过交换机转发回来。
3. The VSwitch de-encapsulates the packet, presenting it to one of the stateful packet inspection processes in a format the process will understand. Once the packet has been processed, the packet is switched toward the mail server, which directs it back through the VSwitch.
4. VSwitch 现在将重新施加正确的 SFC 标头,将数据包放入正确的隧道中以到达链中的下一个服务(在本例中为 NAT 服务),并通过 DC 结构将数据包转发回。
4. The VSwitch will now reimpose the correct SFC header, place the packet into the correct tunnel to reach the next service in the chain (in the case, a NAT service) and forward the packet back out over the DC fabric.
5.具有 SFC 功能的 NAT 服务接收数据包并根据需要对其进行处理。NAT 服务使用本地路由信息构建正确的隧道封装,以将数据包传送到链上的下一个服务。在这种情况下,下一个服务是电子邮件过滤服务,它也恰好是 SFC 感知的。
5. The NAT service, being SFC capable, receives the packet and processes it as needed. The NAT service uses local routing information to build the correct tunnel encapsulation to carry the packet to the next service along the chain. In this case, the next service is the email filtering service, which also happens to be SFC aware.
6. SES服务处理完数据包后,注意到服务链已经结束,并且数据包可以被转发到其最终目的地——邮件服务器本身。SES 服务删除 SFC 标头,并通过 VSwitch 将剩余的 IP 数据包转发到其物理连接到的 ToR 交换机。
6. The SES service, having processed the packet, notes that the service chain has ended, and the packet may be forwarded on to its final destination—the mail server itself. The SES service removes the SFC headers and forwards the remaining IP packet through the VSwitch toward the ToR switch it is physically connected to.
7. ToR 接收数据包,检查其本地路由信息,并通过 DC 结构将数据包转发到邮件服务器。
7. The ToR receives the packet, examines its local routing information, and forwards the packet over the DC fabric to the mail server.
那么,SFC 主要是上述第一种(也可能是第二种)服务链的示例 - 服务本身利用隧道机制控制通过数据中心(或云)结构的数据包转发已经存在于覆盖(或虚拟)网络中。
SFC is, then, primarily an example of the first (and potentially the second) kind of service chaining outlined above—the services, themselves, control the forwarding of the packet through the data center (or cloud) fabric by taking advantage of tunneling mechanisms that already exist within the overlay (or virtual) network.
分段路由是 IETF 网络中源数据包路由 (SPRING) 工作组的产品。基本概念与SFC相同,但是SPRING 假设路由控制平面实际上将管理通过网络的流路径,而不是假设路径上的进程将管理服务链。SPRING 可以使用两种操作模式之一,如图11.7所示。
Segment routing is a product of the Source Packet Routing in the Network (SPRING) working group of the IETF. The basic concept is the same as SFC, but rather than assuming the processes along the path will manage the service chain, SPRING assumes the routing control plane will actually manage the path of flows through a network. SPRING can use one of two modes of operation, as illustrated in Figure 11.7.
要理解此图中的流量,从一个基本定义开始非常重要。段实际上代表一个网段(如以太网段,但实际上可能不是以太网)或两个设备之间的邻接关系。在实践中,网段通常被分配给路由接口,就像目标 IP 地址一样,或者可能是知道如何到达服务或服务所在网段的设备的入站接口。这可能看起来有点令人困惑,但重点是让流量到达特定网段上的服务(或服务集合)。SFC 路由至特定服务,而 SPRING 路由至特定段。从分段路由的角度来看,这是有意义的,因为它侧重于控制平面,而不是专门侧重于服务。在现实世界中,服务和分段的概念可能没有太大区别,但在尝试理解技术时理解术语非常重要。
To understand the flow of traffic in this illustration, it’s important to start with one basic definition. A segment actually represents a network segment (like an Ethernet segment, however, it might not actually be Ethernet) or an adjacency between two devices. In practice, a segment is normally assigned to a routed interface much like a destination IP address, or perhaps the inbound interface to the device that knows how to reach the service or the segment where the service lives. This might seem a little confusing, but the point is to get the traffic to the service (or collection of services) that lives on a specific segment. While SFC routes to specific services, SPRING routes to specific segments. This makes sense from the perspective of segment routing, as it is control plane focused, rather than specifically service focused. In the real world, there might not be much of a difference between the concept of a service and a segment, but understanding the terminology is important when trying to understand the technology.
在图 11.7中:
In Figure 11.7:
笔记
Note
本描述重点介绍使用 MPLS 通过网络沿分段路由传输数据包。SPRING 实际上默认在 IPv6 中使用松散源路由标头选项,但使用 MPLS 标签堆栈来说明更简单,因此这里使用的是它。
This description focuses on using MPLS to transport a packet along a segment route through a network. SPRING actually defaults to using a loose source route header option in IPv6, but it’s simpler to illustrate using an MPLS label stack, so that’s what is used here.
1.流量由入口路由器接收。检查目标地址,或者进行某种形式的数据包检查,以确定数据包携带的信息类型(例如 HTTPS SSH),从而确定数据包需要插入哪个服务链。一旦确定,MPLS 标签堆栈就会被施加到数据包上,称为 SR 隧道。该堆栈对数据包在从尾端隧道释放之前必须访问的段列表进行编码。
1. Traffic is received by the ingress router. The destination address is examined, or some form of packet inspection takes place to determine the type of information the packet is carrying (such as HTTPS SSH), and thus into which service chain the packet needs to be inserted. Once this is determined, an MPLS label stack is imposed onto the packet, called the SR Tunnel. This stack encodes the list of segments that the packet must visit before being released from the tunnel at the tail end.
2、根据外层MPLS标签转发报文;数据包沿着 MPLS 标签 1000 的 LSP 转发,到达链中的第一个路由器。路由器 B 已配置为将标签 1000 交换为 1001,并将数据包转发到池中的数据包检查进程之一。这称为分段路由中的继续;在转发数据包之前,当前的外层标签将被替换,而不是弹出。
2. The packet is forwarded based on the outer MPLS label; the packet is forwarded along the LSP for MPLS tag 1000, where it reaches the first router in the chain. Router B has been configured to swap label 1000 for 1001, and forward the packet along to one of the packet inspection processes in the pool. This is called a continue in segment routing; the current outer label is replaced, rather than popped, before the packet is forwarded.
3.路由器 B 根据新的外部 MPLS 标签转发数据包,这次转发到数据包检查进程之一。就像 SFC 一样,分段实际上可以是 SPRING 中的任播目的地。这允许数据包被转发到一组段中的一个,每个段都(至少)附加了一个特定服务的实例。数据包检查服务完成数据包的处理,并将其转发回路由器 B。
3. Router B forwards the packet based on the new outer MPLS label, this time to one of the packet inspection processes. Just like SFC, a segment can actually be an anycast destination in SPRING. This allows the packet to be forwarded to one of a set of segments, each of which has (at least) one instance of a particular service attached to it. The packet inspection service finishes processing the packet, and forwards it back to Router B.
4.路由器 B 现在弹出外层标签,因为该段已完成。这会将堆栈中的第二个 MPLS 标签公开为新的外部标签。路由器 B 根据这个新的外层标签,通过标签值 1002 指示的 LSP 转发数据包。
4. Router B now pops the outer label, as this segment has been completed. This exposes second MPLS label in the stack as the new outer label. Router B forwards the packet, based on this new outer label, through the LSP indicated by the label value, 1002.
5. MPLS 标签1002 恰好表示与RouterC 连接的虚拟交换机与NAT 服务之间存在邻接关系。NAT 服务直接参与服务链,因此它处理数据包并弹出外层标签,暴露堆栈中的下一个标签,值为 1003。NAT 进程通过 Router C 将数据包转发出去,Router C 只是进行切换它基于链中下一个 SES 服务的外部标签。
5. The MPLS label 1002 happens to indicate an adjacency between a virtual switch connected to Router C and a NAT service. The NAT service is directly participating in the service chain, so it processes the packet and pops the outer label, exposing the next label in the stack, with a value of 1003. The NAT process forwards the packet out through Router C, which simply switches it based on the outer label toward the SES service next in the chain.
6. SES 服务接收数据包,对其进行处理,然后弹出堆栈中的最终标签,公开 IP 标头。此时,服务将数据包转发给Hypervisor,根据原始的三层转发信息进行转发。
6. The SES service receives the packet, processes it, and pops the final label in the stack, exposing the IP header. At this point, the service forwards the packet toward the hypervisor to forwarding based on the original layer 3 forwarding information.
7.虚拟机管理程序使用 IP 目标地址将数据包转发到最终邮件服务器,完成数据包通过服务链的旅程。
7. The hypervisor uses the IP destination address to forward the packet toward the final mail server, completing the journey of the packet through the service chain.
由于分段路由可以使用 MPLS LSP 作为其底层传输,因此可以使用流中数据包必须访问的服务列表或分段以外的其他内容来定义流经网络的路径。例如,可以使用集中控制器计算受可用带宽、延迟或其他因素约束的网络路径,然后使用 PCEP 发送该路径信号。
As segment routing can use MPLS LSPs as its underlying transport, it’s possible for the path of a flow through the network to be defined using something other than simply the list of services or segments packets in the flow must visit. It is possible, for instance, to use a centralized controller computing a path through the network constrained by available bandwidth, latency, or other factors, and then signal the path using PCEP.
因此,分段路由为网络运营商定义一组服务或分段提供了很大的灵活性,特定流必须通过这些服务或分段通过网络。
Segment routing, then, provides a lot of flexibility to the network operator in defining a set of services or segments through which a particular flow must pass on its way through the network.
NFV 与某种形式的服务链(无论是 SFC、SPRING 还是其他选项)相结合,为运营商提供了一组强大的工具,使其围绕服务而不是流量来组织网络。从本质上讲,这些技术不是将服务引入流量,而是让运营商将流量引入服务。通过基于策略的路由、基于源或基于策略的路由的黑客攻击、各种形式的隧道以及其他选项,这始终是可能的,但这些选项在架构上都不像本章中介绍的那样干净。
NFV, combined with some form of service chaining (whether SFC, SPRING, or some other option), provides a powerful set of tools for operators to organize their networks around services, rather than around traffic flows. Essentially, rather than bringing services to the traffic flow, these technologies allow the operator to bring the traffic to the services. This has always been possible with policy-based routing, hacks with source- or policy-based routing, various forms of tunneling, and other options—but none of these are as architecturally clean as the options presented in this chapter.
然而,有了这样的背景,回到复杂性的概念就很重要了。这些形式的“极端流量工程”涉及哪些复杂性权衡?下一章将讨论这个问题。
With this background in place, however, it’s important to return to the concept of complexity. What are the complexity tradeoffs involved in these forms of “extreme traffic engineering”? The next chapter addresses this question.
网络服务(例如状态数据包过滤、网络地址转换和按传输数据单位计费)传统上与设备相关联。如果您想阻止对特定应用程序或部分网络的访问,您可以购买一种称为防火墙的设备,将其安装在机架中,并连接网络,以便进入网络的流量将通过该设备。但是,为什么数据包检查(理想地定位为支持新功能快速部署的软件服务)要与设备绑定,从而与网络中的特定物理位置绑定?为什么一组特定的服务(无论逻辑上如何分组)应该与一个特定的物理设备相关联?与安装在机架中的物理设备捆绑在一起的相关服务不同,应该可以在网络中任何位置的通用硬件上运行基于软件的单独服务(甚至微服务)。这将允许运营商根据需要扩展服务,并从软件而不是硬件的角度管理服务的生命周期。
Network services, such as stateful packet filtering, network address translation, and billing per unit of data transmitted, have traditionally been associated with appliances. If you wanted to block access to a particular application or part of the network, you would purchase an appliance called a firewall, mount it in a rack, and cable the network so traffic passing into the network would pass through the appliance. But why should packet inspection, for instance—ideally positioned as a software service to support rapid deployment of new features—be tied to an appliance, and hence to a specific physical location in the network? And why should a specific set of services, no matter how logically grouped, be associated with one specific physical appliance? Instead of a bundle of related services tied to a physical appliance mounted in a rack, it should be possible to run individual software-based services (perhaps even microservices) on generic hardware located anywhere in the network. This would allow the operator to scale services out as needed, and manage the lifecycle of services from a software, rather than hardware, point of view.
然而,以这种方式虚拟化服务带来了第二个问题。如果服务是虚拟化的,因此可以放置在网络中的任何位置,那么如何引导流量通过它?连接物理设备时,通过将设备连接到网络来引导流量通过服务。显然需要某种“虚拟布线”。隧道和基于策略的路由(或基于过滤器的转发)是此类问题的明显解决方案,但它们增加了许多需要管理的复杂配置。随着服务上线,或者随着策略动态变化,整个隧道基础设施和/或策略需要调整——这是一项艰巨的任务,即使不是不可能的。
Virtualizing services in this way, however, presents a secondary problem. If the service is virtualized, and can therefore be placed anywhere in the network, how can traffic be directed through it? When connecting physical appliances, traffic is directed through the service by cabling the appliance into the network. Obviously some sort of “virtual cabling” is needed. Tunneling and policy-based routing (or filter-based forwarding) are the obvious solutions to this sort of problem, but they add a lot of complex configurations that need to be managed. As services are brought online, or as policies change dynamically, the entire infrastructure of tunnels and/or policy needs to be tuned—a difficult, if not impossible, task.
服务链是一种允许网络运营商将每个服务(或微服务)视为更大服务中的构建块的技术,可以解决这些问题。部署服务链等技术允许架构师围绕资源的最佳使用来设计网络,而不是围绕流量在通过网络时必须流过的服务来设计。
Service chaining, a technique allowing the network operator to treat each service (or microservice) as a sort of building block within a larger service, can resolve these problems. Deploying a technology such as service chaining allows the architect to design the network around optimal usage of resources, rather than around the services through which traffic must flow while passing through the network.
将服务虚拟化为在网络上任何位置运行的通用硬件,并使用服务链从虚拟化服务中构建更大的定制应用程序,似乎是双赢的局面。通过允许运营商将服务的物理和逻辑位置与物理拓扑断开连接,服务链和虚拟化服务在网络顶部构建应用程序、扩展网络和服务以及断开硬件方面提供了很大的灵活性生命周期来自服务生命周期。这种出现应该立即引起警惕。如果您还没有找到折衷方案,那么您还没有努力寻找。
Virtualizing services into generic hardware running anywhere on the network, and using service chaining to build a larger, customized application out of the virtualized services, seems like a win/win situation. By allowing the operator to disconnect the physical and logical locations of a service from the physical topology, service chaining and virtualized services allow a great deal of flexibility in building applications on top of the network, in scaling the network and services, and in disconnecting hardware lifecycles from service lifecycles. This appearance should immediately raise a red flag. If you haven’t found a tradeoff, then you haven’t looked hard enough.
本章重点探讨服务虚拟化的权衡。第一部分考虑策略分散和网络虚拟化的概念,以及管理大型系统时策略和复杂性之间的直接关系。第二部分考虑部署服务虚拟化时的其他复杂性因素,包括故障域(紧耦合)和故障排除(MTTR),稍微绕道一些编码历史,因为它与现实世界中的可维护性相关。
This chapter is focused on exploring the tradeoffs in service virtualization. The first section considers the concept of policy dispersion and network virtualization, along with the direct relationship between policy and complexity in managing a large-scale system. The second section considers other complexity factors when deploying service virtualization, including failure domains (tight coupling) and troubleshooting (MTTR)—with a slight detour through some coding history as it relates to maintainability in the real world.
第三部分考虑编排效应——服务虚拟化如何允许运营商将网络视为一组服务和流程,而不是拓扑和设备。在连接业务驱动程序方面,这种抽象为桌面带来了强大的功能,但它也给设计空间带来了自身的复杂性。本章的第四部分考虑了一些关于管理网络设计中引入的服务虚拟化和链接复杂性的实用建议。最后一部分提供了关于服务虚拟化的一些最终想法。
The third section considers the orchestration effect—how service virtualization allows the operator to treat the network as a set of services and processes, rather than as a topology and devices. This abstraction brings a great deal of power to the table, in terms of connecting to business drivers, but it also brings its own level of complexity into the design space. The fourth section of this chapter considers some practical advice on managing the complexity service virtualization and chaining introduce into network designs. The final section provides some final thoughts on service virtualization.
第 4 章“操作复杂性”稍微涉及了政策分散。图 12.1提供了一个示例来回顾这个概念,并继续围绕策略分散和服务虚拟化进行讨论。
Chapter 4, “Operational Complexity,” touched on policy dispersion a bit. Figure 12.1 provides an example as a refresher on the concept, and to continue discussion around policy dispersion and service virtualization.
在这个小型网络中,有一个策略 - 每个数据包在始发主机(图左侧)和目标服务器(图右侧)之间传递时必须通过状态数据包检查。假设状态数据包检测服务可以虚拟化,或者以某种方式放置在网络拓扑中的任何位置,那么它应该插入和配置在哪里?在拓扑中的 A 点、B 点还是 C 点?
In this small network, there is a single policy—each packet must pass through a stateful packet inspection while passing between the originating host (on the left side of the diagram) and the destination server (on the right side of the diagram). Assuming that the stateful packet inspection service can be virtualized, or somehow placed anywhere in the network topology, where should it be inserted and configured? At point A, point B, or point C in the topology?
如果策略在A点配置,则必须在四台不同的设备上进行配置和管理。策略中的任何更改都必须分发到这四个设备中的每一个,以便为每个受影响的用户提供大致相同的服务。无论如何,这项政策不仅必须一致地实施,而且必须在短时间内实施。还记得第一章“定义复杂性?”中的 CAP 定理吗?”保持大量设备的配置分散在整个网络中是该定理的直接应用——一致、可访问和可分区,选择两个。由于配置分布在多个设备上,因此默认选择可分区,因此可访问或一致都必须受到影响。最常见的是,一致性是三元组中受到影响的元素。网络运营商常常“放弃”实时管理网络中每个设备的配置。
If the policy is configured at point A, it must be configured and managed on four different devices. Any change in the policy must be distributed to each of these four devices in a way that produces roughly equivalent service for each of the users impacted. Somehow this policy must not only be distributed consistently, but also within a short time period. Remember the CAP theorem from Chapter 1, “Defining Complexity?” Keeping the configurations of a large number of devices scattered throughout a network is a direct application of this theorem—consistent, accessible, and partitionable, choose two. As the configuration is spread out over multiple devices, partitionable is chosen by default, so either accessible or consistent must suffer. Most often, consistent is the element of the triad that suffers; network operators often “give up” on managing the configuration of every device in the network in real time.
虽然在 A 点配置策略具有增加管理压力的缺点,并且必然导致并非每个设备都会近乎实时地进行相同的配置,但它具有积极的补偿功能。任何因不符合状态包过滤设备上配置的策略而被丢弃的数据包都将被提前丢弃,这意味着网络资源不会浪费在传输最终将被丢弃到过滤设备的数据包上。它还将大部分网络暴露给未经过滤的流量——攻击者极有可能通过网络使用这个“未经过滤的空间”作为攻击面。
While configuring the policy at point A has the downside of increasing management strain, and the certain result that not every device is going to be configured the same in near real time, it has a compensating positive feature. Any packets that would otherwise be discarded because they fall outside the policies configured on the stateful packet filtering devices will be discarded early, which means network resources will not be wasted carrying packets that will ultimately be dropped to the filtering device. It also exposes a large part of the network to unfiltered traffic—there is some remote chance an attacker could use this “unfiltered space” through the network as an attack surface.
将状态包过滤放在C点怎么样?这确实减少了需要管理的设备数量,但它允许未过滤的流量通过网络。
What about placing the stateful packet filtering at point C? This does reduce the number of devices that need to be managed to one—but it allows unfiltered traffic to pass through the network.
将状态数据包过滤器放置在 C 点不仅会降低网络使用效率,而且还会降低网络使用效率。但是,这也意味着放置在 C 点的设备必须能够支持放置在 A 点的设备相同的负载和流量。换句话说,C 点的单个设备必须能够扩展到网络中 A 点相同设备大小的四倍。
Placing the stateful packet filter at point C doesn’t just use the network less efficiently; however, it also means the device placed at point C must be able to support the same load and traffic flow that devices placed at point A would. In other words, the single device at point C must be able to scale to four times the size of the same devices positioned at point A in the network.
当前的服务模式存在两个问题。首先,服务必须带入流量。虽然可以允许一定程度的流量工程,但最终所有流量都必须流经实例化服务的某些设备或设备,以便应用策略。其次,服务的扩展方式必须适应其在网络中的位置,而不是适应其使用情况。从运营或业务角度来看,这些都不是理想的。
There are two problems with the current service model, then. First, the service must be brought to the traffic flow. While some degree of traffic engineering can be admitted, in the end all traffic must flow through some appliance or device that instantiates the service in order for the policy to be applied. Second, the service must be scaled in a way that accommodates its placement in the network, rather than in a way that accommodates its usage. Neither of these are ideal from an operational or business perspective.
如果不是将服务引入流量(将有状态数据包过滤器放置在流量必然经过它们的网络中),而是将流量引入服务,会怎么样?这将使服务在网络中的位置与服务所代表的策略的实现脱节,并且允许横向扩展而不是纵向扩展服务。
What if instead of bringing the service to the traffic—placing the stateful packet filters in the network where the traffic will necessarily pass through them—the traffic is brought to the service? This would disconnect the location of the service in the network from the implementation of the policies the service represents, and it would allow the service to be scaled out, rather than up.
这正是服务链允许运营商做的事情。服务链使您可以将流量引入策略,这样您就可以在更少的位置、更少的设备上实施策略,最好是在一组标准硬件上进行虚拟化。图 12.2展示了将状态数据包检测服务收集到在网络中的一个位置运行的单个可扩展服务后的同一网络,并使用链接将数据包通过服务器拉至最终目的地。
This is precisely what service chaining allows the operator to do. Service chaining lets you bring the traffic to the policy, so you can implement the policy in a fewer number of places, across a smaller set of devices, preferably virtualized on a standard set of hardware. Figure 12.2 illustrates the same network after gathering the stateful packet inspection services into a single, scalable service running in one location in the network, with chaining used to pull packets through the server to the final destination.
检查状态/速度/表面模型上的几个不同点将有助于更清晰地关注各种复杂性权衡。
Examining a few different points along the state/speed/surface model will help to bring the various complexity tradeoffs into sharper focus.
服务虚拟化背后的主要驱动力是减少分散在整个网络中的策略数量。如果操作正确,服务虚拟化可以将策略从大量设备(数百到数十万——在虚拟化移动设备(例如手机)的网络边缘的情况下)转移到单个数据中心结构中。一旦这些服务被移动,与之相关的策略就可以通过自动化系统进行集中管理。这对于运营商来说是一个巨大的收益,因为无论是全面的还是个性化的策略都可以得到有效的管理,从策略的角度降低了成本和复杂性。
The primary driver behind virtualizing the services is the reduction in the amount of policy dispersed throughout the network. Done correctly, service virtualization can move policy from a large number of devices (hundreds to hundreds of thousands—in the case of virtualizing the network edge for mobile devices, such as cell phones) into a single data center fabric. Once these services are moved, the policy that goes with them can be centrally managed through an automation system. This is a huge gain for the operator, because both across the board and individualized policies can be efficiently managed, lowering cost and complexity from a policy perspective.
然而,从控制平面的角度来看,网络中的状态量增加了。
From the perspective of the control plane, however, the amount of state in the network increases.
首先,必须有某种形式的隧道和相关的控制平面来承载通过服务链传输到一个目的地的流量(如第 11 章“服务虚拟化和服务链”中所述)。假设虚拟化服务驻留在数据中心,隧道和相关的控制平面信息不会增加很多额外的复杂性。大多数数据中心网络已经构建了一个底层,该底层通过结构(通常仅是 IP 或 IP/MPLS)提供某种形式的简单交换,然后是一个覆盖,为虚拟化连接提供更丰富的控制平面,必要的隧道机制可能会已经就位了。
First, there must be some form of tunneling and the associated control plane in place to carry traffic that is transmitted to one destination through a chain of services instead (as described in Chapter 11, “Service Virtualization and Service Chaining”). Assuming that the virtualized services will reside in a data center, tunneling and the associated control plane information isn’t going to add a lot of additional complexity. Most data center networks are already built with an underlay that provides some form of simple switching through the fabric (often IP or IP/MPLS only), and then an overlay that provides a richer control plane for virtualized connectivity, the necessary tunneling mechanisms will likely already be in place.
其次,服务链必须强加在某个地方 - 必须有某种设备创建包含每个服务信息的隧道标头,并且链式流中的任何数据包都必须传递到隧道中(或者将服务链标头写入数据包中) ) 某种程度上来说。为了实现这一点,服务链本身必须通过控制平面或管理系统传送到数据中心网络的边缘(通常是入站结构边缘或网关)。该服务链信息代表通过控制平面承载的附加状态,因此控制平面状态的数量增加,并且控制平面的复杂性增加。
Second, the service chain must be imposed someplace—there must be some device that creates a tunnel header with information about each service, and any packet within a chained flow must be passed into the tunnel (or have the service chain header written into the packet) in some way. To accomplish this, the service chain itself must be carried through a control plane or management system to the edge of the data center network (generally the inbound fabric edge or gateway). This service chain information represents an additional state carried through the control plane—hence an increase in the amount of control plane state, and an increase in the complexity of the control plane.
那么,在状态领域,虚拟化提供了一个混合包。从策略角度来看,它减少了策略分散并减少了状态,但它增加了控制平面中的状态,并增加了网络延伸。与网络设计中的所有事情一样,没有一个答案是适合每个网络的“正确”答案。大多数网络可能会通过服务虚拟化看到其总体复杂性的降低,而较小比例的网络可能会看到总体复杂性的增加。
In the area of state, then, virtualization offers a mixed bag. It reduces policy dispersion and decreases state from a policy perspective, but it increases state in the control plane, and increases network stretch. As with all things in network design, no single answer is the “right” answer for every network. Most networks will probably see their complexity decrease in aggregate complexity through the virtualization of services, while a smaller percentage might see an increase in aggregate complexity.
通过服务链承载流量几乎肯定会增加网络延伸。比较图 12.1和12.2充分说明了随着服务从驻留在在线设备上转变为离线服务,最后成为连接到数据中心结构的虚拟化服务,这种延伸不断增加。虽然数据中心结构已经(或应该)针对极低延迟、低数据包丢失水平、低抖动以及服务质量中的所有“其他最佳”进行了优化,但网络中仍然存在几个额外的跃点。流量必须通过数据中心边缘(网关),通过结构的三跳(或更多跳,取决于网络拓扑)到达架顶式 (ToR) 交换机,到达虚拟化状态数据包过滤服务所在的位置,通过 ToR 计算端的虚拟机管理程序和虚拟交换机,通过虚拟机管理程序和虚拟交换机返回,再通过结构再经过三跳,最后到服务器。以这种方式增加延伸将增加流必须经过的数据平面的复杂性。有几个额外的队列,几个必须对数据包进行解封装然后封装的额外点等。
Carrying a flow through a service chain will almost certainly increase network stretch. Comparing Figures 12.1 and 12.2 amply illustrates the increasing stretch as the service moves from residing on an inline appliance to an offline service along the path to, finally, a virtualized service connected to a data center fabric. While the data center fabric is (or should be) optimized for very low latency, low packet loss levels, low jitter, and all the “rest of the best” in quality of service, there are still several additional hops through the network. The flow must move through the data center edge (gateway), three (or more, depending on the network topology) hops through the fabric to the Top-of-Rack (ToR) switch, to where the virtualized stateful packet filtering service is located, through the hypervisor and virtual switch on the compute side of the ToR, back through the hypervisor and virtual switch, through three more hops through the fabric, and finally to the server. Increasing the stretch in this way will increase the complexity of the data plane through which the flow must travel. There are several additional queues, several additional points where packets must be de-encapsulated and then encapsulated, etc.
以这种方式增加延伸也会降低网络在资源的纯粹利用方面的效率。通过虚拟化,网络的整体效率可能会提高,但随着网络路径延伸的增加,处理一个流所需的带宽总量也会增加。因此,网络的优化(至少在带宽利用效率方面)因虚拟化而受到影响。
Increasing the stretch in this way also reduces the efficiency of the network in terms of pure utilization of resources. The efficiency of the network overall might increase through virtualization, but the amount of bandwidth required to handle one flow will increase in aggregate as the stretch of the path through the network increases. The optimization of the network therefore suffers, at least in terms of efficiency of bandwidth utilization, through virtualization.
将图 12.1与图 12.2进行比较,可以看出数据包在被过滤之前必须通过网络所花费的时间,从而说明了服务链中的优化权衡。在大多数服务链解决方案中,流量通过网络进行隧道传输,在一定程度上降低了安全暴露风险(隧道通常不是一个强大的安全机制,但它至少可以防止内部设备接口暴露给外部流量)。未经过滤的流量仍然通过网络传输,只是在过滤器处被丢弃,浪费带宽、电力和其他资源。
Comparing Figure 12.1 with Figure 12.2 exposes the amount of time a packet must spend passing through the network before being filtered, illustrating the optimization tradeoff in service chaining. In most service chaining solutions, traffic is tunneled through the network, reducing the security exposure risk to some degree (tunneling is not a strong security mechanism in general, but it at least prevents internal device interfaces from being exposed to external traffic). Unfiltered traffic is still carried through the network only to be dropped at the filter, wasting bandwidth, power, and other resources.
笔记
Note
该图涵盖了进出数据中心结构的流量,因此说明了检查工程师在考虑服务链时需要寻找的最坏情况。对于数据中心或云结构内的东/西流量,这些权衡可能会走向另一个方向——服务链实际上可能在网络利用率方面提供更好的优化。工程师需要了解这两种情况,并仔细考虑哪些地方增加了复杂性,以及如何影响优化。
This illustration covers traffic passing into and out of the data center fabric, and, as such, illustrates the worst case to examine the tradeoff engineers’ need to look for when considering service chaining. For east/west traffic within a data center or cloud fabric, these tradeoffs may run in the other direction—service chaining might actually provide better optimization in terms of network utilization. Engineers need to be aware of both situations, and carefully consider where complexity is being added, and how optimization is impacted.
返回图 12.2,DC 边缘路由器(图中标记为路由器 A)如何知道对入站流中的数据包施加什么服务链?该信息必须以某种方式在控制平面中携带(这一点在上一节中关于状态的内容中已提到),但是如何将其传送到控制平面以携带到边缘路由器呢?控制平面和提供此信息的外部系统之间必须存在某种连接。这个系统必须做什么?
Returning to Figure 12.2, how does the DC edge router, marked Router A in the diagram, know what service chain to impose on the packets in the inbound flow? This information must be carried in the control plane in some way—a point noted in the previous section on state—but how is it communicated to the control plane to be carried to the edge router? There must be some connection between the control plane and an outside system that provides this information. What must this system do?
• 确定什么是正确的政策。
• Determine what the correct policies are.
• 确定必须将哪些流映射到哪些策略。
• Determine which flows map each to which policy must be applied.
• 确定虚拟服务位于网络中可以提供这些服务的位置。
• Determine where the virtual services are in the network that can supply those services.
• 确定正确的服务链以通过该组服务推送数据包。
• Determine the correct service chain to push packets through that set of services.
• 将此服务链传达至控制平面。
• Communicate this service chain to the control plane.
忽略此外部系统增加的复杂性(可能称为编排,因为它编排哪些流映射到哪些服务位置)控制平面和该系统之间的交互是必须考虑的另一个方面。添加新的交互界面来完成这些任务将明显增加网络作为一个完整系统的复杂性。
Ignoring the added complexity of this external system—what might be called orchestration, as it orchestrates which flows map to which service locations—the interaction between the control plane and this system is another surface that must be accounted for. Adding a new interaction surface to allow these tasks to be completed will clearly increase the complexity of the network as a complete system.
另一个需要考虑的交互面是数据包级策略和转发策略之间的连接。当流量进入路由器 A 的 DC 结构时,必须有某种形式的分类器来识别该数据包属于必须施加特定服务链的特定流。这种分类机制本质上充当服务的代理,任何给定的数据包或流都必须通过该代理进行链接。集中一种策略的过程有效地将另一种策略分散到网络边缘。
Another interaction surface to consider is the connection between packet level policy and forwarding policy. When traffic enters the DC fabric at Router A, there must be some form of classifier that recognizes that this packet belongs to a specific flow onto which a specific service chain must be imposed. This classification mechanism essentially serves as a proxy for the services through which any given packet or flow must be chained. The process of centralizing one policy has effectively dispersed another sort of policy to the edge of the network in its stead.
当然,这两套政策之间存在很多差异:
There are a lot of differences between these two sets of policies, of course:
• 数据包检查策略具有一组(可能)深层的过滤/通过策略,必须检查一系列字段。检查的信息包括有关跨数据包的流的信息,例如流的状态、正在传输的信息类型以及有效和无效编码。
• Packet inspection policy has a (potentially) deep set of filter/pass policies that must examine an array of fields. The information examined includes information about the flow that spans packets, such as the state of the flow, the type of information being transferred, and valid and invalid encodings.
• 数据包分类通常设计为对尽可能少的信息进行操作,例如目标地址或五元组(源和目标地址、源和目标端口以及协议)。
• Packet classification is generally designed to operate on the minimum amount of information possible, such as a destination address or a five tuple (the source and destination addresses, the source and destination ports, and the protocol).
• 数据包检查通常独立于网络中。数据包检查可能与单个设备上的其他服务结合使用,但从概念上讲,您不会将数据包检查与负载平衡混合在一起。
• Packet inspection normally stands alone in the network; packet inspection might be combined with other services on a single appliance, but conceptually you wouldn’t mix packet inspection with load balancing, for instance.
• 通常对所有数据包执行数据包分类,以支持多种服务。可以在网络边缘对数据包进行分类和链接,以进行数据包检查、负载共享和许多其他服务。
• Packet classification is normally performed across all packets to support a wide array of services. Packets can be classified and chained at a network edge for packet inspection, load sharing, and a number of other services.
更广泛的说法是,通过将网络边缘的实际数据包检测策略替换为将流量引导到服务链的数据包分类器,您可以用更通用的服务替换专门的服务。虽然这并不能消除策略分散,但它允许多个服务沿网络边缘共享相同的通道和实施。因此,复杂性并没有通过服务链从边缘消除,但复杂性的权重可以从边缘转移到集中式服务。
A broader way to state is that by replacing the actual packet inspection policy at the network edge with a packet classifier that directs traffic into a service chain, you are replacing a specialized service with a more generic one. While this doesn’t eliminate policy dispersion, it allows multiple services to share the same channel and implementations along the network edge. Hence, complexity is not eliminated from the edge through service chaining, but the weight of complexity can be moved from the edge toward the centralized services.
使用转发策略作为网络深处更复杂策略的代理还有另一个值得考虑的副作用——控制平面和转发平面之间的交互面实际上变得更深,而不仅仅是更宽。在控制平面中承载策略意味着控制平面不仅负责确定可达性,而且还承载在数据包通过网络时必须应用的策略。每个边缘设备上的控制平面和转发平面之间必须有挂钩或交互点,以确保控制平面中携带的策略信息被正确解释、安装和管理。
Using forwarding policy as a proxy for a more complex policy deeper in the network has another side effect worth considering—the interaction surface between the control and forwarding planes actually becomes deeper, not just broader. Carrying policy in the control plane means the control plane is not only responsible for determining reachability, but also for carrying policy that must be applied to packets as they pass through the network. There must be hooks or interaction points between the control and forwarding planes on each edge device to ensure that the policy information carried in the control plane is interpreted, installed, and otherwise managed correctly.
一些复杂性权衡并不完全适合状态/速度/表面模型;本节将介绍这些内容。更传统的网络设计者应该关注三个特定领域:
Several complexity tradeoffs don’t fit neatly into the state/speed/surface model; these will be covered in this section. Three specific areas should be of concern to the more traditional network designer:
• 故障域的大小。
• The size of failure domains.
• 抽象(隐藏状态)和MTTR 之间的关系。
• The relationship between abstraction (hiding state) and the MTTR.
• 网络运行的可预测性。
• Network operation predictability.
网络工程师学到的第一个教训(艰难的方式)是故障域是设计的重要组成部分。两个紧密耦合的系统可以形成单个故障域;其中一个的故障会“泄漏”到第二个,导致第一个故障蔓延到第二个。一个例子是路由重新分配,如图12.3所示的网络所示。
One of the first lessons network engineers learned (the hard way) was that failure domains are an important part of the design. Two tightly coupled systems can form a single failure domain; a failure in one will “leak” into the second, causing a failure from the first to spread to the second. An example of this is in route redistribution as illustrated using the network shown in Figure 12.3.
在此图中,路由器 A 通过 eBGP 向路由器 B 发送大量路由。为了使场景更真实,假设这是一个完全默认的免费 Internet 路由表,因此有超过 50 万条路由(截至发稿时,但随着时间的推移,它肯定会增长)。路由器 B 被配置为路由器 E 的路由反射器客户端,因此它会在 AS 内传递此完整表。不过,路由器 B 还配置为将从路由器 A 通过 eBGP 获知的少量路由重新分发到本地 OSPF 进程中。虽然协议之间的这种重新分配并不常见,但在现实世界中它是有用的。
In this illustration, Routers A is sending some large number of routes to Router B through eBGP. To make the scenario realistic, assume that this is a full default free Internet routing table, so more than a half a million routes (at presstime, but it’s bound to grow over time). Router B is configured as a route reflector client of Router E, so it is passing this full table on within the AS. Router B is also, however, configured to redistribute a small number of the routes learned through eBGP, from Router A, into a local OSPF process. While such redistribution between protocols is uncommon, there are real-world situations where it is useful.
如果配置了此类重新分发,则重新分发通常会以某种方式失败。要么是操作员错误配置了过滤器(也许过滤器基于某人禁用的社区,因为他们不知道它的用途),要么是重新分发代码有缺陷(这两者都是现实世界的情况)。失败的结果是所有BGP路由都重新分配到OSPF中,OSPF无法收敛。OSPF 的故障会导致依赖底层 IP 连接才能正常运行的 BGP 也发生故障,从而导致整个网络崩溃。
If such redistribution is configured, it’s common for the redistribution to fail in some way. Either an operator misconfigures a filter (perhaps the filters are based on a community someone disables because they don’t know what it’s being used for) or the redistribution code has a defect (both, again, are real-world situations). The result of the failure is all the BGP routes being redistributed into OSPF, and the failure of OSPF to converge. OSPF’s failure then causes BGP, which relies on underlying IP connectivity to operate correctly, to fail as well, so the entire network crashes.
在这种情况下,OSPF 和 BGP 紧密耦合 — 不仅通过重新分配,而且还通过 BGP 依赖 OSPF 提供底层它需要运行的可达性。尽管如此,这种情况常常让网络工程师感到惊讶,他们认为重新分配的过滤器或两个控制平面之间重新分配路由的过程通过消除或减少耦合来分离故障域。
OSPF and BGP are tightly coupled in this situation—not only through redistribution, but also through BGP’s reliance on OSPF to provide the underlying reachability it needs to operate. Still, this type of situation often surprises network engineers, who assume that the filters on redistribution, or the process of redistributing routes between two control planes, separate the failure domains by removing or reducing coupling.
转发到路由器 A 后面的目的地的流量也可能在路由器 C 或 D 处被黑洞,而不是使网络崩溃。假设数据包由路由器 E 转发到路由器 C,目的地位于路由器 A 之外的某处。 C 未运行 BGP,它必须依靠通过 OSPF 获知的路由来构建将在其上交换此数据包的转发表条目。如果 BGP 没有在路由器 B 上重新分配到 OSPF 中,路由器 C 如何知道目的地?它不会——因此,它会减少流量。在这种情况下,两个控制平面(BGP和OSPF)都可以正常运行,但它们组合在一起会导致网络故障。同样,在此网络配置中,两个系统必须紧密耦合才能正确运行网络。
Rather than crashing the network, it’s also possible for traffic forwarded to destinations behind Router A to be black holed at either Router C or D. Assume that a packet is forwarded to Router C by Router E with a destination somewhere beyond Router A. As Router C is not running BGP, it must rely on routes learned through OSPF to build the forwarding table entries on which this packet will be switched. If BGP is not being redistributed into OSPF at Router B, how will Router C know about the destination? It won’t—hence, it will drop the traffic. In this case, both control planes (BGP and OSPF) can be operating properly, but combined they cause a network failure. Again, the two systems, in this network configuration, must be tightly coupled for proper network operation.
就本书中使用的复杂性模型中的表面而言,涉及两个交互表面。
In terms of surface in the complexity model used throughout this book, there are two interaction surfaces involved.
• 重新分配,通过在网络中少量、明确定义的点进行重新分配来控制广度。该交互表面的深度是通过重新分布过滤器控制的,因此虽然它是深度交互,但其范围被限制在尽可能小的范围。
• Redistribution, which is controlled for breadth by redistributing at a small, well-defined number of points in the network. The depth of this interaction surface is controlled through redistribution filters—so although it is a deep interaction, it is limited in scope to the minimum possible.
• 一个系统提供的可达性与另一系统的操作之间的交互不是很深入,但很广泛。这两个协议之间第一个接触点的深度控制失败会导致第二个接触点的失败——这是跨两个系统之间多个交互表面的经典级联失败案例。
• The interaction between reachability provided by one system and the operation of the other system is not very deep, but it is very broad. The failure of the depth control in the first point of contact between these two protocols causes a failure in the second—a classic cascading failure case across multiple interaction surfaces between two systems.
软件设计长期以来一直围绕相同的基本概念进行。例如:
Software design has long worked around the same fundamental concepts. For instance:
笔记
Note
当服务松散耦合时,对一项服务的更改不应要求对另一项服务进行更改。微服务的全部意义在于能够对一项服务进行更改并部署它,而无需更改系统的任何其他部分。这确实非常重要。1
When services are loosely coupled, a change to one service should not require a change to another. The whole point of a microservice is being able to make a change to one service and deploy it, without needing to change any other part of the system. This is really quite important.1
1 . Sam Newman,构建微服务,第一版(O'Reilly Media,2015),30。
1. Sam Newman, Building Microservices, First Edition (O’Reilly Media, 2015), 30.
在网络边缘强加由控制平面承载的服务链就在系统之间创建了这种耦合。使用服务链将控制平面绑定到服务架构,并将服务架构绑定到数据平面或转发状态。这种耦合本质上将服务、将服务和服务链联系在一起的任何编排服务以及控制平面组合到一个大型故障域中。图 12.4说明了一种情况,其中一个系统中的故障可能会以灾难性的方式影响其他系统。
Imposing a service chain carried by the control plane at the edge of the network creates just this sort of coupling between systems. Using service chains binds the control plane to the services architecture, and the services architecture to the data plane, or the forwarding state. This sort of coupling essentially combines the services, any orchestration service that ties services and service chains together, and the control plane into a single large failure domain. Figure 12.4 illustrates one case where a failure in one system can impact the other systems in a catastrophic way.
在图 12.4中:
In Figure 12.4:
• 数据包进入路由器 A,其目标前缀为 2001:db8:0:1::/64。
• A packet enters at Router A with a destination prefix of 2001:db8:0:1::/64.
• 路由器A 有一些策略,对该数据包强加一条服务链,先通过标记为1000 的服务,然后通过1001,然后释放以进行正常转发。
• Router A has some policy that imposes a service chain on this packet to pass through the service labeled 1000, then 1001, and then to be released for normal forwarding.
• 数据包沿着业务链到达Router B,Router B 删除第一个标签,然后将其转发到Router C。
• The packet follows the service chain to Router B, which removes the first label and then forwards it to Router C.
• 报文沿着业务链到达RouterC,RouterC 处理报文,删除业务链,并沿最短路径转发到2001:db8:0:1::/64。
• The packet follows the service chain to Router C, which processes the packet, removes the service chain, and forwards along the shortest path to 2001:db8:0:1::/64.
• 从路由器 C 到 2001:db8:0:1::/64 的最短路径是经过路由器 A。
• The shortest path from Router C to 2001:db8:0:1::/64 is through Router A.
• 数据包被转发到路由器A,路由器A 有一些策略强制该数据包通过标记为1000 的服务,然后通过标记为1001 的服务,然后释放以通过网络进行正常的最短路径转发。
• The packet is forwarded to Router A, which has some policy that imposes a service on this packet to pass through the service labeled 1000, then 1001, and then to be released for normal shortest path forwarding through the network.
问题应该是显而易见的。如果流向 2001:db8:0:1::/64 的流量足够大,则可能会从服务链的简单故障转变为级联故障 — 循环流量很容易淹没它所通过的三个链路之一。正在通过,也会导致分布式路由协议失败。
The problem should be obvious. If the flow toward 2001:db8:0:1::/64 is large enough, it’s possible to move from a simple failure of the service chain to a cascading failure—the looping traffic can easily overwhelm one of the three links over which it is passing, causing the distributed routing protocol to fail, as well.
网络的有效故障排除在很大程度上取决于一些底层流程,例如半分裂 - 该过程:
The effective troubleshooting of a network is heavily dependent on a few underlying processes, such as half splitting—the process of:
1.寻找信号或信息在系统中的路径。
1. Finding the path of the signal or information through the system.
2.找到这条路径的“中间”点。
2. Finding a “half way” point in this path.
3.在中间点测量信号以确定其是否正确(如预期)。
3. Measuring the signal at the halfway point to determine if it is correct (as expected) or not.
4.如果符合预期,找到当前测量点和信号路径末端之间的中间点并重复。
4. If it is as expected, find the point halfway between the current measurement point and the end of the signal path and repeat.
5.如果不符合预期,请找到当前测量点和信号路径起点之间的中间点并重复。
5. If it is not as expected, find the point halfway between the current measurement point and the beginning of the signal path and repeat.
对网络(或任何系统)进行故障排除的能力取决于工程师了解系统中信号路径或信息流的能力。Edsgar Dijkstra 在他关于编程语言中goto语句的论文中用更雄辩的术语描述了这种情况:
The ability to troubleshoot a network (or any system) depends on the ability of an engineer to understand the path of the signal, or the flow of information, through the system. Edsgar Dijkstra described this situation in more eloquent terms in his paper on goto statements in programming languages:
笔记
Note
我的第一句话是,尽管程序员的活动在他构建了正确的程序后就结束了,但在程序控制下发生的过程才是他活动的真正主题,因为正是这个过程必须达到预期的效果;正是这个过程的动态行为必须满足所需的规范。然而,一旦程序制作完成,相应流程的“制作”就交给了机器。肆无忌惮地使用 go to 语句会产生一个直接后果,即很难找到一组有意义的坐标来描述流程进度。2
My first remark is that, although the programmer’s activity ends when he has constructed a correct program, the process taking place under control of his program is the true subject matter of his activity, for it is this process that has to accomplish the desired effect; it is this process that in its dynamic behavior has to satisfy the desired specifications. Yet, once the program has been made, the “making” of the corresponding process is delegated to the machine. The unbridled use of the go to statement has an immediate consequence that it becomes terribly hard to find a meaningful set of coordinates in which to describe the process progress.2
2 . Edsgar Dijsktra,“转到被认为有害的声明”,亚利桑那大学,np,最后修改于 1968 年,2015 年 6 月 5 日访问,http://www.u.arizona.edu/~rubinson/copyright_violations/Go_To_Considered_Harmful.html。
2. Edsgar Dijsktra, “Go To Statement Considered Harmful,” University of Arizona, n.p., last modified 1968, accessed June 5, 2015, http://www.u.arizona.edu/~rubinson/copyright_violations/Go_To_Considered_Harmful.html.
Dijkstra——链路状态协议中使用最广泛的 SPF 算法的发明者——本质上是说goto语句使得将信息流与代码流连接起来变得更加困难。试图排除故障并解决问题的人不能只是简单地阅读代码,而是必须通过遵循goto语句(无论它们指向何处)来“打破上下文”,并尝试将整个流程重新组合到他们的头脑中。再次回到 Dijkstra 的论点:
Dijkstra—the inventor of the most widely used SPF algorithm used in link state protocols—is essentially saying here that goto statements make it much harder to connect the flow of the information with the flow of the code. Instead of simply reading the code, someone trying to troubleshoot and fix a problem must “break context” by following goto statements wherever they lead, trying to put the entire flow back together in their heads. Turning to Dijkstra’s argument again:
笔记
Note
我的第二句话是,我们的智力更适合掌握静态关系,而我们可视化随时间演变的过程的能力相对发展得相对较差。因此,我们应该(作为明智的程序员意识到我们的局限性)尽最大努力缩短静态程序和动态过程之间的概念差距,使程序(在文本空间中展开)和过程(在文本空间中展开)之间建立对应关系。及时)尽可能微不足道。3
My second remark is that our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed. For that reason we should do (as wise programmers aware of our limitations) our utmost to shorten the conceptual gap between the static program and the dynamic process, to make the correspondence between the program (spread out in text space) and the process (spread out in time) as trivial as possible.3
3 . 同上。
3. Ibid.
将其转移到网络世界,特别是服务链,很简单:服务链是相当于代码中 goto 语句的软件。
Transferring this to the networking world, and service chaining in particular, is simple: a service chain is the software equivalent of a goto statement in code.
编码人员现在广泛使用goto — Linux 内核代码4中有超过 100,000 个goto语句— 但他们是在模式(或模型)内这样做的,这样可以轻松理解 goto 的用途以及如何跟踪流经内核的信息。过程。使用多个goto语句而不是if/else结构被认为是不好的形式,但是当遇到错误时使用goto来跳出循环被认为是可以的。一个goto导致另一个goto被认为是特别糟糕的形式,因为这通常会导致意大利面条式代码。
Coders now use goto widely—there are over 100,000 goto statements in the Linux kernel code4—but they do so within patterns (or models) that make it easy to understand what the goto is being used for and how to trace information flowing through the process. Multiple goto statements being used instead of an if/else construction is considered bad form, but using goto to break out of a loop when an error has been encountered is considered okay. A goto leading to another goto is considered particularly a bad form, as this often leads to spaghetti code.
4 . “Goto”,维基百科,2015 年 6 月 5 日访问,https://en.wikipedia.org/wiki/Goto#Criticism_and_decline。
4. “Goto,” Wikipedia, accessed June 5, 2015, https://en.wikipedia.org/wiki/Goto#Criticism_and_decline.
在网络世界中,使用服务链创建意大利面条式流程与在程序中使用goto 语句创建意大利面条式代码一样简单(也一样糟糕) 。结果是系统中的信息流与服务链本身脱节,导致故障排除、高 MTTR 和高平均维护间隔时间等问题。
In the networking world, creating a spaghetti flow with service chaining can be just as easy (and just as bad) as creating spaghetti code using goto statements in a program. The result is a disconnect between the flow of the information through the system and the service chain itself, causing major headaches in troubleshooting, high MTTR, and high Mean Time Between Maintenance.
作为一个实际示例,请考虑网络工程师正在尝试通过特定网络解决抖动问题。工程师从网络边缘开始,检查将用于通过网络转发特定类型流中的数据包的表条目。检查入站设备上的转发表,工程师发现了一组流标签。
As a practical illustration, consider the network engineer who is trying to resolve a problem with jitter through a particular network. The engineer begins at the network edge, examining the table entries that will be used to forward packets in flows of a specific type through the network. Examining the forwarding table on the inbound device, the engineer finds a set of flow labels.
怎么办?这组流标签到底是什么意思呢?这个数据包会流经哪个网段?该信息被抽象出来;工程师必须通过深入转发表将标签与转发路径连接起来。路径本身也与拓扑断开。
Now what? What precisely does this set of flow labels mean? Which network segment will this packet flow over? That information is abstracted out; the engineer must connect the label with a forwarding path by stepping deeper into the forwarding table. The path itself is disconnected from the topology, as well.
这些脱节中的每一个都会降低工程师快速掌握和解决手头问题的能力,从而使网络级问题的故障排除确实成为一个复杂的问题。
Each of these disconnects reduces the ability of the engineer to quickly grasp and troubleshoot the problem at hand—making troubleshooting network level problems a complex problem, indeed.
一项经常未阐明的网络设计原则是可预测性。有两个具体的考虑因素推动了网络设计中对可预测性的渴望:
One often unstated network design principle is predictability. Two specific considerations drive the desire for predictability in a network design:
• 很难以任何方式衡量不断变化的事物,从而真正提供有意义的基线。如果网络状态不断变化,包括特定流或应用程序流量的路径,则很难感觉到“正常”是什么样子。这种无法确定什么是“正常”的情况直接影响了网络工程师预测未来需求以及在问题出现后修复问题的能力。如果你不知道“工作”是什么样子,就很难知道什么出了问题。
• It’s difficult to measure things that are constantly changing in any way that actually provides a meaningful baseline. If the state of the network is constantly changing, including the path of particular flows or application traffic, it’s difficult to get a feel for what “normal” looks like. This inability to determine what “normal” looks like directly impacts the ability of the network engineer to predict future requirements, and to repair problems once they appear. It’s hard to know what is broken if you don’t know what “working” looks like.
• 很难知道发生任何特定故障后网络的状态会如何。如果某个链路发生故障,流量会转移到哪里,对转移到的链路会产生什么影响?
• It’s hard to know what the state of the network will be after any specific failure occurs. If a link fails, where will the traffic move, and what will the impact be on the links it moves to?
如果网络工程师无法回答此类问题,他们可能会陷入恐慌——它们是预测需要多少额外带宽、需要应用服务质量、控制延迟和抖动的能力的基础,以及许多其他事情。
Network engineers can get into something of a panic if they can’t answer these kinds of questions—they’re fundamental to the ability to predict how much additional bandwidth is needed, where quality of service needs to be applied, controlling delay and jitter, and many other things.
尝试掌握更大的图景(挖掘网络中的信息)比尝试对每个链接进行预测更有意义。并非一切都可以在网络工程的设计或部署阶段定义;有时它必须被发现。
It makes more sense to try and get a handle on a bigger picture—to mine the network for information, rather than to try and make per link predictions. Not everything can be defined in the design or deployment phase of network engineering; sometimes it must be discovered, instead.
然而,在网络之上,服务链和功能虚拟化可以提高您通过两种方式理解和管理网络信息流的便利性。
Rising above the wire, however, service chaining and function virtualization can improve the ease of which you can understand and manage the flow of information through the network in two ways.
首先,服务链抽象出拓扑——很像链路状态洪泛域边界——只留下有关特定流必须通过哪些服务以及这些服务如何修改流的信息。这种抽象减轻了编排系统的负担,编排系统根据需要旋转服务并将流量与服务匹配,从而无需处理更细粒度的细节,例如流量实际上如何流经网络。同样,这种抽象允许那些更具商业头脑的人专注于信息流,而无需参与网络的实际运作方式。这可以改善业务驱动程序和网络工程(在设计、实施和运营方面)之间的联系,从而在两个世界之间提供一条途径。
First, service chaining abstracts out the topology—much like a link state flooding domain boundary—leaving just information about what services a particular flow must pass through, and how those services modify the flow. This abstraction relieves the orchestration system, which is spinning up services and matching flows to services as needed, from dealing with more fine grained details, such as how traffic will actually flow through the network. In the same way, this abstraction allows those who are more business minded to focus on the flow of information without getting involved in how the network actually works. This can improve the connection between business drivers and network engineering (in design, implementation, and operations), providing a pathway between the two worlds.
这有效地将拓扑与策略分开,对控制平面进行分层。每个结果层都能很好地处理一种类型的信息,并通过明确定义和约束的表面与另一层进行交互。这遵循内聚原则:将相关行为聚集在一起,并将不相关行为推送到其他组件中。这也有助于实施单一责任原则:将每个特定服务或组件集中于做好一项工作。
This effectively separates topology from policy, layering the control plane. Each resulting layer deals with one type of information well, and interacts with the other layer through a well-defined and constrained surface. This follows the cohesion principle: gathering related behavior together, and pushing unrelated behavior into other components. This also helps implement the single responsibility principle: focusing each particular service or component on doing one good job well.
其次,虚拟化功能允许将服务分解为可以横向扩展的较小部分,而不是维护为必须纵向扩展的较大整体单元。然后,每个服务都成为一个相当标准的“事物”,可以单独管理,包括放置在网络上有资源的地方,而不是放置在流量已经流动的地方。虚拟化与服务链相结合,允许将服务作为低接触服务进行管理,而不是作为高接触设备或独特的定制工程解决方案(也称为“特殊雪花”)。
Second, virtualizing functionality allows services to be broken into smaller pieces that can be scaled out, rather than maintained as larger monolithic units that must be scaled up. Each service then becomes a fairly standard “thing” that can be managed separately, including being placed where there are resources on the network rather than where the traffic is already flowing. Virtualization, combined with service chaining, allows services to be managed as low-touch services, rather than as high-touch appliances, or unique, custom engineered solutions (also known as “special snowflakes”).
回到本书的开头并考虑复杂性在网络中所扮演的角色很重要。解决复杂性是不可能的;相反,需要复杂的解决方案来解决难题。网络工程师、设计师、架构师以及所有其他从事网络技术工作的人员所能做的就是管理复杂性。不要害怕复杂性,但不要让复杂性在不会增加任何价值的地方出现。本着这种思路,当服务虚拟化和链接发挥作用时,网络工程师如何管理复杂性?
It’s important to go back to the beginning of this book and consider the role complexity plays in the network. It’s impossible to resolve complexity; rather, complex solutions are required to solve hard problems. The most network engineers, designers, architects, and all the other people working in and around networking technologies can do is to manage complexity. Don’t be afraid of complexity, but don’t allow complexity to build where it doesn’t add any value. In the spirit of this line of thinking, how can network engineers manage complexity when service virtualization and chaining come into play?
首先,回归基础。还记得图 12.6中所示的沙漏吗?
First, return to the basics. Remember the hourglass, illustrated in Figure 12.6?
努力在网络中构建沙漏,因为这有利于跨域的松散耦合,并允许您将复杂性与复杂性分开。虽然您可能没有多个物理层协议,但您可能会面临在单个结构上部署十几个不同的控制平面以及匹配数量的传输覆盖机制的诱惑。毕竟,如果在同一 IP 基础设施上运行 VXLAN、NVGRE、MPLS 和 MPLSoGRE 会有什么坏处呢?如果有多个 SDN 控制器、BGP 和其他一些机制都作为重叠的控制平面纠缠在一起,那又有什么关系呢?
Strive to build an hourglass in the network, as this facilitates loose coupling across domains and allows you to separate complexity from complexity. While you might not have multiple physical layer protocols, you might face the temptation to deploy a dozen different control planes, and a matching number of transport overlay mechanisms, onto a single fabric. After all, what does it hurt if you run VXLAN, NVGRE, MPLS, and MPLSoGRE on the same IP infrastructure? What does it matter if there are multiple SDN controllers, BGP, and some other mechanisms all tangled up as overlapping control planes?
通过返回之前围绕状态、速度、表面和优化构建的复杂性模型,可以看到它所带来的差异。每个额外的控制平面都会添加另一层状态;每个额外的运输系统都会增加另一个交互表面。这里的经验法则可能是:由于您无法控制网络某些区域的复杂性,例如应用程序在网络上运行的方式以及流量通过网络的路径,因此请严格控制您可以控制的复杂性。这将鼓励沙漏模型,最终使复杂性变得易于管理。
The difference it makes can be seen by returning to the model of complexity built earlier around state, speed, surface, and optimization. Each additional control plane adds another layer of state; each additional transport system adds another interaction surface. A rule of thumb here might be: since you can’t control complexity in some areas of the network, for instance the way applications run across the network and the path traffic takes through the network, keep a tight lid on the complexity you can control. This will encourage the hourglass model, ultimately making the complexity manageable.
其次,记住goto的教训。一旦你开始使用goto,它就会让人上瘾。“为什么不把这部分服务放在这个数据中心,而那部分放在另一个数据中心呢?为什么后端的各个数据库之间、前端的服务之间、中间的业务逻辑之间不能服务链?”
Second, remember the lesson of goto. Once you start using goto, it can be addictive. “Why not just place this part of the service over here in this data center, and that part over there in another one? Why can’t we service chain between various databases on the backend, and between services on the front end, and between business logic in the middle?”
答案是因为你最终会得到意大利面条式的代码。正如意大利面条式代码在应用程序内无法维护一样,在因破坏应用程序而产生的服务之间也是无法维护的。围绕服务链的goto语句允许和不允许的内容构建一个框架。例如,一条建议的规则可能是“没有流量,一旦通过某个服务,就应该再次通过同一服务。” 这相当于没有通过多个模块链接goto ,从而无法在心理上将流量与逻辑(或服务)流连接起来。
The answer is because you’re going to end up with spaghetti code. Just as spaghetti code is unmaintainable within an application, it’s also unmaintainable between the services that result from breaking an application up. Build a framework around what is, and isn’t, allowed with the goto statement of service chaining. One suggested rule, for instance, might be “no traffic, once it passed through a service, should pass through the same service again.” This would be the equivalent of not having goto’s chained through a number of modules, making it impossible to mentally connect the traffic flow with the logic (or services) flow.
鉴于构建和管理虚拟化服务和服务链的复杂性,为什么这么多大型网络运营商朝这个方向发展?主要原因实际上是业务逻辑和开发人员(这一次)将复杂性抛回了隔间墙上。复杂性是属于网络,还是属于应用程序和业务逻辑,最终是业务决策。不过,通常有几个因素决定将这种复杂性推入网络是有意义的。例如:
Given the complexity of building and managing virtualized services and service chaining, why are so many large network operators moving in this direction? The primary reason is actually that the business logic and development folks are(for once) throwing complexity back over the cubicle wall. Whether the complexity belongs in the network, or in the application and business logic, is ultimately a business decision. Often, though, a few factors dictate that pushing this complexity into the network makes sense. For instance:
• 使应用程序开发流程与业务运营的流程和节奏相匹配,从而使 IT 能够创造更多价值。这反过来又使 IT 成为业务更加一体化的一部分——这是一件好事。
• Making the application development process match the flow and pace of business operations allows IT to drive more value. This, in turn, makes IT a more integrated part of the business—a good thing.
• 将整体应用程序分解为单独的服务,使企业能够动态组合应用程序。这意味着可以开发和提供新服务,而无需完全从头开始构建;重用现有服务比重用现有整体应用程序更容易。
• Breaking monolithic applications into individual services allows the business to compose applications on the fly. This means new services can be developed and offered without building them entirely from scratch; reuse of existing services is easier than reuse of existing monolithic applications.
• 将整体应用程序分解为单独的服务,使开发团队能够以一种避免许多传统问题的方式构建和管理应用程序,例如在每个人都不敢使用的角落服务器上运行的庞大、大规模的应用程序。接触,但对企业的运营至关重要。
• Breaking monolithic applications into individual services allows the development teams to build and manage applications in a way that prevents a number of traditional problems—such as the huge, large-scale application that runs on a server sitting in the corner that everyone is afraid to touch, but is crucial to the business’ operation.
考虑到成本与收益的权衡,服务虚拟化通常实际上是一个好的权衡,而不是一个坏的权衡。这就是为什么这个概念变得如此流行,特别是在大型服务提供商网络中。
Given the cost–benefit tradeoffs, service virtualization is normally actually a good tradeoff, rather than a bad one. This is why the concept is becoming so popular, particularly in large-scale service provider networks.
云计算是当前网络和信息技术领域的重大变革之一。大多数情况下,每个人都想构建云或使用云。作为回应,业界围绕云计算建立了一套几乎人人都知道的定义、实践和概念,但几乎没有人真正理解。
Cloud computing is one of the big movements in the networking and information technology worlds right now. Mostly everyone wants to either build a cloud, or use one. In response, the industry has built a set of definitions, practices, and concepts around cloud computing that almost everyone knows—and almost no one really understands.
当然,云的承诺是,通过将你的东西放在云中,你将所有基础设施和管理外包给其他人——理论上,这个人可以比你更便宜地管理基础设施,并且可以为你提供根据您的需要提供按需服务。迁移到云应该可以为您节省大量金钱和时间。至少理论上是这样。
The promise of cloud, of course, is that by putting your stuff in the cloud you’re outsourcing all the infrastructure and management to someone else—someone who, in theory, can manage the infrastructure cheaper than you can, and can provide you with on-demand services as you need them. Moving to the cloud should save you lots of money and time. At least that’s the theory.
本章将考虑复杂性和云。迁移到云真的能解决您在网络、计算和存储基础设施方面面临的所有复杂性问题吗?或者它只是将复杂性转移到其他地方?云是否通过将所有复杂性交给其他人来提供解决复杂性的“灵丹妙药”,或者它实际上只是工程师需要仔细考虑的另一组选项和权衡?
This chapter will consider complexity and the cloud. Does moving to the cloud really solve all the complexity problems you face on the networking, compute, and storage infrastructure front? Or does it just move the complexity someplace else? Does cloud provide “the” silver bullet for complexity, by throwing all the complexity into someone else’s lap, or is it really just another set of options and tradeoffs engineers need to consider carefully?
鉴于本书中讨论的有关复杂性的所有内容,答案应该是您已经熟知的答案 - 云并不是所有网络复杂性问题的最终解决方案。你可以改变复杂性,但你无法摆脱它或解决它。
Given everything discussed about complexity in this book, the answer should be one you know well already—cloud isn’t the ultimate solution to all your network complexity problems. You can move complexity around, but you can’t get rid of it or solve it.
本章将从三个角度探讨云。第一个是围绕复杂性和云的模型,第二个是可用于分解所支持的服务类型的替代模型。最后一部分将讨论在考虑将云作为解决方案时的一些更具体的复杂性。
This chapter will approach cloud from three perspectives. The first is a model around complexity and the cloud, the second is an alternative model useful for breaking down the types of services being supported. The final section will discuss some more specific complications and complexities when considering cloud as a solution.
鉴于您已经读到这里,您已经知道您无法消除复杂性。事实上,你不可能一直有效地把它扔过隔间的墙壁;它会随着时间的推移而增加。有时,将问题拆分到错误的位置并尝试卸载错误的部分的结果会比开始时更加复杂。为了避免这些结果,以某种方式对问题空间进行建模通常很有用,这样您就可以考虑状态、速度和表面方面的复杂性权衡。图 13.1展示了一种模型,您可以使用该模型将权衡分解为可管理的部分。
Given you’ve read to this point, you already know that you can’t eliminate complexity. In fact, you can’t really toss it over the cubicle wall very effectively all the time; sometimes the result of splitting the problem in the wrong place, and trying to offload the wrong parts, is more complexity than what you started with. To avoid these results, it’s often useful to model the problem space in some way, so you can think through the complexity tradeoffs in terms of state, speed, and surface. Figure 13.1 illustrates one model you can use to break the tradeoffs down into manageable pieces.
该图说明了任何特定应用程序或系统的三种不同类型的部署。在现实世界中,这三者之间存在任意数量的不同程度(例如混合云),沿着连续体关注这三点对于理解正在发生的一些基本问题很有用。
This figure illustrates three distinct types of deployment for any particular application or system. There are, in the real world, any number of varying degrees between these three (for instance, hybrid cloud), focusing on these three points along the continuum is useful for understanding some of the basic issues in play.
在每种情况下,您都必须管理业务逻辑、应用程序开发、决定存储内容、流程管理以及与应用程序的实际构建和运行相关的任何其他内容,以满足一组特定的业务需求。然而,当您从右向左移动时,不同的复杂性会相互权衡。
In every case, you must manage the business logic, application development, making decisions about what to store where, process management, and anything else related to the actual building and running of an application to meet a specific set of business needs. As you move from right to left, however, different pieces of complexity trade off against one another.
从最左边开始,以云为中心的部署是将所有处理和数据存储移动(或创建)到云提供商的服务中。这非常有吸引力,因为它允许企业专注于手头问题的实际信息处理方面,完全消除与基础设施硬件、架构、网络或计算和存储设计、基础设施工具和软件、控制平面协议的任何交互,以及与运行网络相关的所有其他部分。这些部分可能会带来巨大的管理负担,特别是对于主要关注产品或服务而不是技术的企业而言。
Starting at the far left, a cloud centric deployment is to move (or create) all processing and data storage into a cloud provider’s service. This can be very attractive, as it allows the business to focus on the actual information processing aspects of the problems at hand, removing entirely any interaction with infrastructure hardware, architecture, network or compute and storage design, infrastructure tools and software, control plane protocols, and all the other pieces that relate to running a network. These pieces can be a huge administrative load, particularly for businesses whose main focus is on a product or service, rather than on technology.
然而,权衡的另一面是一系列增加的业务和技术复杂性。在业务方面,您的业务对提供商的依赖,包括响应能力和信任。值得注意的是,大多数提供商都有自己的业务计划(很惊讶,对吧?),并且他们的业务计划和您的业务计划并不总是完全重叠。换句话说,提供商可以关注您的业务状况来发展他们的业务,但这并不总是像您想象的那样。
On the other side of the tradeoff, though, are a bundle of added business and technical complexities. On the business side is the reliance on a provider for your business, including responsiveness and trust. It’s important to note that most providers have their own business plans (big surprise, right?), and the alignment of their business plan and your business plan isn’t always completely overlapping. In other words, the provider can be concerned about the state of your business to grow theirs—but this doesn’t always work out the way you think it might.
以供应商为中心的模型是大多数网络世界所采用的模型。对于任何给定的项目,都会提出一组要求,然后检查一组供应商,看看他们可能拥有或计划拥有哪些产品可以满足这些要求。工程师并不倾向于将这种类型的模型视为外包形式,但事实确实如此。外包的是一些基本(或高级)设计工作、网络软件流程(包括协议)的管理以及硬件组件的设计和构建。
The vendor centric model is the one most of the networking world tends to live under. For any given project, a set of requirements are set out, and then a set of vendors are examined to see what products they might have, or are planning to have, that will meet those requirements. Engineers don’t tend to think of this type of model as a form of outsourcing—but it is. What’s being outsourced is some basic (or high-level) design work, management of the network software process (including protocols), and the designing and building of the hardware components.
这种类型的外包有许多积极的特点。它使该公司拥有一个供应商合作伙伴,该合作伙伴正在做跟上技术趋势所需的大量基础研究,并且在出现问题时可以“拨打电话”。这个解决方案很像使用塑料积木建造一座城堡——是的,选择是有限的,但它们通常仅限于具有更多专业知识和经验的人知道应该可行的东西。这可以减轻公司的大量工程负担,同时允许半定制的解决方案使业务保持领先地位,同时保留业务本身数据的所有权。
There are many positive attributes to this type of outsourcing. It places the company in the position of having a vendor partner who is doing much of the fundamental research needed to keep up with technical trends, and “one number to call” when things go wrong. This solution can be much like using plastic building blocks to build a castle—the options are limited, yes, but they are often limited to things someone with more expertise and experience knows should work. This can take a lot of engineering load off the company while allowing semicustomized solutions that can keep the business ahead of the game, while retaining ownership of the data within the business itself.
业务仍然需要与业务逻辑、应用程序和信息处理难题进行交互,但不需要认真考虑网络设计和架构、设备生命周期和其他相关问题。这种权衡的另一方面是企业现在必须管理一组供应商关系,包括通过培训、传统设备、思维方式和许多其他潜在因素寻找在该特定供应商设备和供应商锁定方面合格的人员。虽然这是大多数公司做出的选择,但它实际上可能是两全其美的情况——供应商关系可能与供应商关系一样难以驾驭,设计约束可能会限制战略优势的机会,从而阻碍业务增长,最终导致业务陷入困境。通过升级和新产品推出来约束供应商的增长计划。
The business still needs to interface with the business logic, applications, and information handling pieces of the puzzle, but doesn’t need to think as hard about network design and architecture, equipment life cycles, and other related matters. The other side of this tradeoff is the business must now manage a set of vendor relationships, including finding people who are qualified in that specific vendor’s equipment and vendor lock-in through training, legacy gear, mindset, and many other potential factors. While this is the choice most companies make, it can actually be the worst of both worlds—vendor relationships can be as difficult to navigate as provider relationships, design constraints can stunt business growth by limiting opportunities for strategic advantage, and the business ends up being bound to the vendor’s growth plan through upgrades and new product rollouts.
以网络为中心的愿景最好的描述是与网络技术的白盒趋势保持一致。在这一领域,基础设施硬件和软件可能会外包给不同的公司或供应商,从而将两者解开,而 IT 人员则承担更多的集成工作,将所有部分组合成一个统一的整体。
The network centric vision might best be described as aligning with the white box trend in networking technology. In this area, the infrastructure hardware and software might be outsourced to different companies or vendors, untying the two, and the IT staff takes on more of the integration work to put all the pieces together into a single unified whole.
沿着这条道路前进需要大量、有能力的 IT 员工,因此在原始复杂性方面不适合胆小的人。参与此类项目的工程师必须以技术为中心,而不是以供应商或产品为中心,并且知道如何集成和管理不同的部分以形成统一的整体。当然,在现实世界中,从工程角度来看,更复杂的以供应商为中心的部署和以网络为中心的部署在复杂程度上可能非常相似,因为实际上没有基于供应商的解决方案实际上涵盖了任何给定项目或业务计划的所有要求。
Moving down this path requires a large, competent IT staff, so it’s not for the faint of heart in terms of raw complexity. The engineers involved in these sorts of projects must be technology focused, rather than vendor or offering focused, and know how to integrate and manage the different pieces to make a unified whole. In the real world, of course, more complex vendor focused deployments and network centric deployments may be very similar in complexity levels from an engineering perspective, as virtually no vendor-based solution actually covers all the requirements for any given project or business plan.
从工程角度来看,复杂性的权衡是将业务从供应商的升级路径中解脱出来,能够根据手头的工作调整网络规模,并能够快速利用设计和架构中的新想法来获得竞争优势。
The tradeoff in complexity from an engineering perspective is untying the business from a vendor’s upgrade path, being able to right size the network to the job at hand, and being able to quickly take advantage of new ideas in design and architecture to reap competitive advantage.
人们很容易在这三条道路中选择一条,并宣称它是“正确的道路”。然而,实际上,许多不同的方法可能最适合任何特定的应用程序或业务。例如,外包销售跟踪可能是一个好主意,同时在以网络为中心的数据中心上为公司独特的制造流程构建定制软件。这里的重点是找到一组模型来帮助架构师思考复杂性和业务权衡,而不是提供简单的答案。
It’s tempting to choose one path among these three and declare it “the right way.” In reality, however, many different ways might be best for any particular application or business. For instance, outsourcing sales tracking might be a good idea, while at the same time building custom software on a network centric data center for manufacturing processes unique to the company. The point here is to find a set of models that help the architect think through the complexity and business tradeoffs, rather than providing a pat answer.
虽然云服务的第一个模型有助于理解使用云服务的权衡,但这里介绍的第二个模型可以更有效地理解云产品的范围和范围。在任何特定的应用或技术系统中,只有几个真正的活动部件:
While the first model of cloud services can be helpful in understanding the tradeoffs in using a cloud service, the second model covered here can be more effective in understanding the range and scope of cloud offerings. There are, in any specific application or technology system, only a few real moving parts:
•存储——当信息没有被主动用于某些用途时,信息被存储在其中
• Storage—where the information is stored when it’s not being actively used for something
•基础设施或管道——系统中在其他系统组件之间移动信息的部分
• Infrastructure, or plumbing—the parts of the system that move information around between the other system components
•计算——用于实际执行所需处理的计算能力
• Compute—the computing power used to actually do the processing required
•软件环境——开发和部署应用程序的软件和服务集,包括操作系统、数据库系统和其他组件
• Software Environment—the set of software and services on which the application is developed and deployed, including an operating system, database system, and other components
•智能/分析——用于处理信息的逻辑以及实现该逻辑的代码
• Intelligence/Analytics—the logic used to process the information, and the code that implements this logic
虽然每个组件本身都非常复杂,但它们是任何信息技术系统中仅有的四个真正的组件。其他一切要么构建、管理,要么以某种方式与这四件事之一交互。那么,云服务可以按照其集中化、虚拟化或外包的内容进行细分,如下表所示:
While each of these components can be very complicated in themselves, these are the only four real components of any information technology system. Everything else either builds, manages, or somehow interacts with one of these four things. Cloud services, then, can be broken down by what they centralize, virtualize, or outsource as outlined in the list that follows:
•存储即服务:许多服务有效地提供集中存储和同步,并在服务之上使用最少的处理和/或应用程序API。这与简单的外包存储相关。截至撰写本文时,该领域的一些服务包括 DropBox、Google Drive、SpiderOak、Azure Storage 和 Amazon S3。
• Storage as a Service: A number of services effectively offer centralized storage and synchronization with minimal processing and/or applications APIs to lay on top of the service. This correlates to simply outsourcing storage. Some services in this space, at the time of this writing, include DropBox, Google Drive, SpiderOak, Azure Storage, and Amazon S3.
•基础设施即服务(IaaS):许多云服务在虚拟机管理程序框架内提供存储、网络连接、处理能力和其他工具。用户必须在服务中运行的虚拟机中安装自己的操作系统,建立任何连接,并根据需要构建其他服务。该服务将始终包含某种形式的存储系统,因此 IaaS 始终包含存储即服务,即使存储不是主要焦点。这是一种集中存储、基础设施和处理的形式,但没有智能和/或分析。
• Infrastructure as a Service (IaaS): A number of cloud services provide storage, network connectivity, processing power, and other tools within a hypervisor framework. Users must install their own operating system within the virtual machines spun within the service, build any connectivity, and build other services as needed. The service will always include some form of storage system—so IaaS always includes storage as a service, even if storage isn’t the primary focus. This is a form of centralizing storage, infrastructure, and processing, but without the intelligence and/or analytics.
•平台即服务(PaaS):它实现存储、基础设施和处理服务,如IaaS(上文),但也包括软件环境。一般来说,智能是以安装在虚拟机上的操作系统、某种形式的数据库后端和其他服务的形式出现的。
• Platform as a Service (PaaS): It implements storage, infrastructure, and processing services, like IaaS (above), but includes the software environment, as well. Generally the intelligence is in the form of operating systems installed on the virtual machines, some form of database backend, and other services.
•软件即服务(SaaS):实现应用程序所有组件的集中化,包括存储、基础设施、处理、软件环境和智能,直至业务逻辑。SaaS 实际上可以在许多标准下实施,包括软件定义的广域网。
• Software as a Service (SaaS): Implements the centralization of all the components of an application, including storage, infrastructure, processing, a software environment, and intelligence, up to and including business logic. SaaS can actually be implemented under many rubrics, including Software Defined Wide Area Network.
询问“什么正在集中”,或者更好的是“什么正在外包”,可以让您清楚地思考正在部署业务和技术之间的接口来解决业务问题。
Asking “what is being centralized,” or even better, “what is being outsourced,” allows you to clearly think about what interface between the business and the technology is being deployed to solve the business problem.
例如,考虑桌面虚拟化(DV,或瘦客户端)。这是一个有趣的案例,因为它明显集中并外包了存储、基础设施、处理和软件环境。但它也集中情报和/或分析吗?这取决于 DV 服务以虚拟桌面上预安装的应用程序的方式提供什么。然而,一般来说,DV 仅包含模型中智能级别可能考虑的最少服务。这意味着主要的安全问题将围绕 DV 环境中正在处理的文件的存储,而不是部署在 PaaS 云上的应用程序中业务逻辑的实际实现。
As an example, consider Desktop Virtualization (DV, or thin client). This is an interesting case because it clearly centralizes and outsources storage, infrastructure, processing, and the software environment. But does it centralize intelligence and/or analytics as well? This would depend on what the DV service offers in the way of applications that are preinstalled on the virtual desktops. Generally speaking, however, DV is only going to include minimal services that might be considered at the intelligence level in the model. This means the primary security concerns are going to be around the storage of files being worked on in the DV environment, rather than the actual implementation of business logic in applications deployed on a PaaS cloud, for instance.
有了这些模型的背景,就可以更容易地考虑将服务迁移到云中所涉及的实际复杂性权衡,并确定哪种类型的服务可能更适合公共云、私有云或混合云,或者某些服务是否应该只是以更传统的方式部署。本节探讨将应用程序或数据推送到云服务时需要管理的几个特定的复杂领域。
With these models in the background, it’s a bit easier to consider the actual complexity tradeoffs involved in moving services into the cloud, and to determine which type of service might fit better in a public, private, or hybrid cloud—or whether some services should just be deployed in a more traditional manner. This section explores several specific areas of complexity that need to be managed when pushing applications or data to a cloud service.
到目前为止,安全是企业不将数据放置在云中最常被提及的原因。这是一个涵盖广泛的领域,但值得考虑针对基于云的服务可能的一些攻击形式、对策以及随之而来的复杂性问题。
Security is, by far, the most commonly cited reason for businesses not to place their data in the cloud. This is well covered territory, but it’s worthwhile to consider some of the forms of attack possible against a cloud-based service, the countermeasures, and the attendant complexity issues.
集中数据是一个更大的目标。20 世纪 60 年代初,一伙人从格拉斯哥和伦敦之间行驶的皇家邮政火车上抢走了钱袋。他们突袭了这列火车,并得到消息称,这列火车将在两个城市之间运送数十家银行的多余现金,并在银行假日周末运往伦敦储存。这是当时英国历史上最大的一笔盗款,但大部分资金从未被追回。教训是,虽然任何特定银行的超额持股规模都不足以值得制定如此精心策划的计划,但将其全部持股集中在一个地方,是非常值得花费时间和精力的。
Centralized data is a bigger target. In the early 1960s, a gang took the money bags from a Royal Mail train travelling between Glasgow and London. They raided the train on a tip that it would be carrying the excess cash from dozens of banks between the two cities, being carried into London for storage over a bank holiday weekend. The take was the largest in British history to that time, and most of the money was never recovered. The lesson is that while the excess holdings of any specific bank wasn’t large enough to merit such a carefully orchestrated plan, the total of their holdings, gathered in a single place, were well worth spending the time and effort to take.
同样,服务提供商云产品的总持有量可能会提供足够的回报来吸引精心设计的“包揽一切”的计划。此类服务的每个业务用户可能认为他们的信息不值得攻击,因为数据的价值无法覆盖获取数据的成本,但信息的聚合可能会呈现出截然不同的景象。
In the same way, the total holdings of a service provider’s cloud offerings may offer enough rewards to attract well designed plans to “take the entire haul.” Each business user of such a service might believe their information is not worth attacking, because the value of the data would not cover the costs of taking it—but the aggregate of information might present a far different picture.
笔记
Note
截至发稿时,新闻中已经报道了几起基于云的系统的重大漏洞。其中一些违规行为导致数百万条包含个人信息的记录丢失,这些记录很可能被用于邪恶目的,例如身份盗窃甚至勒索。这些类型的违规行为应该让企业认真考虑他们存储在基于云的环境中的信息的安全性,并非常认真地对待安全审计和违规程序。你可以外包你的数据,但你不能外包如果你的数据被泄露而清理混乱的责任。
At presstime, several very large breaches in cloud-based systems have been in the news. Some of these breaches have resulted in the loss of millions of records containing personal information that could well be used for nefarious purposes, such as identity theft or even extortion. These types of breaches should make businesses think seriously about the security of the information they’re storing in cloud-based environments, and to be very serious about security audits and breach procedures. You can outsource your data, but you can’t outsource the responsibility of cleaning up the mess if your data is breached.
这里增加的复杂性是双重的。云服务的用户必须审核云提供商以确保安全,并跟踪提供商服务的所有违规行为,而不仅仅是影响客户关心的数据的违规行为。任何违规行为都可能只是更大违规行为的可见顶部。云服务的用户可能还需要采取更多措施来保护其数据,因为数据存储在异地,或者更确切地说存储在网络操作空间的物理范围之外。
The complexity added here is two-fold. The users of the cloud service must audit the cloud provider to ensure security, and track all breaches into the provider’s service, rather than just the ones impacting the data the customer cares about. Any breach could be just the visible top of a much larger breach. The users of the cloud service may also need to take more steps to protect their data given it is being stored off site, or rather outside the physical bounds of their network’s operational space.
与这种担忧相反的是,专业的云提供商通常有能力构建远远超出其任何客户能力的安全系统,因此,虽然集中式数据是一个更大的目标,但它也更容易防御。
The counter to this concern is that professional cloud providers can often afford to build security systems that are far beyond the ability of just about any of their customers—so while centralized data is a bigger target, it is also easier to defend.
这种权衡很可能会收支平衡,但云服务的用户需要意识到从云服务管理安全性所涉及的额外复杂性。“一次性”位置——无需直接访问直接保护服务中存储的信息的基础设施、策略或控制机制。
The tradeoffs might very well break even, but the user of a cloud service needs to be aware of the additional complexity involved in managing security from a “one off” position—without direct access to the infrastructure, policies, or control mechanisms that protect the information stored in the service directly.
交叉客户污染。云环境代表位于一组物理资源上的一组虚拟机、互连(网络)、数据库表和存储空间。任何时候在单个处理器上运行多个进程时,总是可能出现某种形式的交叉污染,这是一种安全威胁,其中一个进程的数据可以被运行在同一硬件上的另一个进程读取或修改。图 13.2说明了虚拟化系统中可能的交叉污染点。
Cross customer contamination. A cloud environment represents a set of virtual machines, interconnections (networks), database tables, and storage spaces located on a set of physical resources. Any time there are multiple processes running on a single processor, some form of cross contamination is always possible—a type of security threat where one process’ data can be read or modified by another process running on the same hardware. Figure 13.2 illustrates possible cross contamination points in a virtualized system.
这通常会被虚拟化软件阻止,但共享系统上的用户仍然需要检查此类损坏和数据泄露(如果可能),这会增加部署到这些共享基础设施上的应用程序的复杂性。
This is often blocked by the virtualization software, but users on a shared system still need to check for such corruptions and data breaches, if possible, adding complexity to the applications deployed onto these shared infrastructures.
物理媒体管理。1981 年,一场特大(5 级)飓风袭击了美国墨西哥湾沿岸,淹没了新奥尔良,迫使该市居民疏散。在灾难期间以及随后的清理工作中,许多建筑物被毁坏,或被毁坏,但基本上得到了重建。这就提出了一个可能奇怪但仍然重要的问题:在这种情况下,保存所有公司数据的硬盘会发生什么情况?如果龙卷风袭击了您所依赖的云服务所在的数据中心,导致硬件散落,您的数据还能安全吗?
Physical Media Management. In 1981, a major (class 5) hurricane came up the Gulf Coast of the United States, flooding New Orleans and necessitating the evacuation of the city. During the disaster itself, and in the cleanup following, many buildings were destroyed, or gutted and essentially rebuilt. This brings up a perhaps odd, but still important, question—what happens to the hard drives that hold all your corporate data in such a situation? If a tornado strikes the data center in which a cloud service you rely on resides, scattering hardware, will your data still be secure?
云服务的用户需要在发生物理灾难、硬盘驱动器和其他设备被处置,甚至丢失单个硬盘来欺骗员工或内部威胁的情况下验证其数据的状态。通常,云用户除了简单地信任提供商的流程和控制,但某些类型的信息可能需要加密,或者需要对部署在云环境中的应用程序进行特殊处理。
Users of cloud services need to verity the status of their data in the case of physical disasters, the disposal of hard drives and other equipment, and even the loss of single hard drives to rouge employees or inside threats. Often there is little the cloud user can do other than simply trust the provider’s processes and controls, but some types of information might need to be encrypted, or require special handling for applications deployed in a cloud environment.
谁控制着加密货币?最后,大多数云提供商都会对静态数据(在硬盘驱动器或其他存储设备上)进行加密,以便用户无法看到彼此的数据。但是,如果提供商提供加密,则提供商必须拥有用于加密数据的密钥。这意味着至少有一个为提供商工作的人可以访问密钥,因此可以读取数据本身。
Who controls the crypto? Finally, most cloud providers encrypt data at rest (on a hard drive or other storage device) so that users cannot see one another’s data. However, if the provider is providing the encryption, then the provider must have the key used to encrypt the data. This means at least one person working for the provider has access to the keys, and hence can read the data itself.
云服务的用户至少有三个选择:
The user of the cloud service has at least three options:
• 第一个是在使用数据的进程中对存储在云服务中的任何信息进行加密,因此通过编码到云环境中运行的应用程序中的某些加密以及提供商的加密来确保其安全。此选项可能会增加云环境中任何应用程序的开发和处理要求的大量开销,并且它并不总是一个实用的解决方案。
• The first is to encrypt any information stored in the cloud service within the processes using the data, so it’s secured by some encryption coded into the applications running in the cloud environment as well as the provider’s encryption. This option can add a good bit of overhead to the development and processing requirements for any application within the cloud environment, and it’s not always a practical solution.
• 第二个是信任(并可能审核)云提供商用于管理对云环境中存储的信息的访问的控制。
• The second is to trust (and potentially audit) the controls the cloud provider uses to manage access to the information stored in the cloud environment.
• 第三是定期审核现有的安全系统,积极检查安全漏洞,或者聘请第三方审核员来承担这一角色。
• The third is to regularly audit the security systems in place, actively checking for security breaches—or perhaps to hire a third party auditor to take on this role.
云环境中存储的信息和应用程序的复杂性空间的第二个关注领域是数据可移植性。特别是在集中式智能(主要是 SaaS 部署)的情况下,云提供商开发和维护的应用程序可能会以非常特定的格式存储。如果企业决定更换提供商或应用程序,从服务中检索任何信息可能很容易,也可能不可能。在 PaaS 中,这可能不太重要,因为企业实际上是在开发在云提供商的服务上运行的应用程序,从而控制信息的格式和存储。
A second area of concern in the complexity space for information and applications stored in a cloud environment is data portability. Particularly in the case of centralized intelligence (largely SaaS deployments), the applications developed and maintained by the cloud provider may be stored in a very specific format. If the business decides to switch providers or applications, retrieving any information from the service can range from easy to impossible. In PaaS, this may be less of a concern, as the business is actually developing applications to run on the cloud provider’s service, and hence controls the formatting and storage of information.
虽然这似乎是一个相当明显甚至是平凡的问题,但在分析将应用程序或存储移至云环境的优缺点时,很少考虑将数据移出云环境的成本。在格式之间移动数据可能会增加新流程或应用程序部署的复杂性。这是考虑外包到基于云的服务的企业需要在流程开始时特别询问的问题,而不是等到最后。
While this might seem like a rather obvious, or even mundane, problem, the cost of moving data out of the cloud environment is rarely considered in any analysis of the advantages and disadvantages of moving applications or storage into a cloud environment. Moving data between formats can add a great deal of complexity in the deployment of a new process or application. This is something businesses considering outsourcing to cloud-based services need to specifically ask about at the beginning of the process—rather than waiting until the end.
本章偏离了本书其余部分中使用的复杂性的状态/速度/表面模型,但对于网络工程师或架构师来说,复杂性不仅仅来自协议或拓扑。云计算中的复杂性权衡可以被视为属于两大类之一:业务流程的复杂性与基础设施的复杂性,以及管理关系的复杂性与管理网络基础设施的复杂性。架构师可以遵循从完全外包/集中到完全内包/管理的连续统一体,以及用于组织所涉及的决策和权衡的各种有用模型。
This chapter has strayed from the state/speed/surface model of complexity used in the rest of this book—but complexity, for the network engineer or architect, doesn’t just come from protocols or topologies. The complexity tradeoff in cloud computing can be seen as falling into one of two broad categories—the complexity of the business process versus the infrastructure complexity, and the complexity of managing a relationship versus the complexity of managing a network infrastructure. There is a continuum from totally outsourced/centralized to totally insourced/managed the architect can follow, and a variety of useful models for organizing the decisions and tradeoffs involved.
即使在您可能认为可以通过“将其扔到隔间墙上”来安全地“卸载”复杂性的地方,您也需要努力寻找复杂性权衡,因为总会在某些地方涉及复杂性权衡。
Even in places where you might think you’re safe to “offload” complexity by “throwing it over the cubicle wall,” you need to look hard for the complexity tradeoffs, because there is always a complexity tradeoff involved someplace.
本书涵盖了大量厚重的理论材料。它首先对复杂性进行总体概述,然后通过各种网络技术来说明各种网络技术中的复杂性。您如何处理这些信息?当然,这些页面中有直接的应用程序;本章试图重点介绍在审查这些领域中取得的所有经验教训。
This book has covered a lot of heavy, theoretical materials. It began with a general overview of complexity, and moved through various networking technologies to illustrate the complexity found in a wide array of networking technologies. What do you do with this information? There are, of course, direct applications within these pages; this chapter attempts to bring all the lessons learned in the examination of each of these areas into focus.
本章首先对复杂性理论进行简短回顾,然后回顾此处用于描述复杂性的三部分模型。最后一部分提出了一些您可以在日常基础上采取的实用步骤来管理复杂性。目标不是提供答案,而是提供问题和框架,帮助您找到任何给定情况的答案。最终目标是提供一组工具以及一种看待旧工具的新方式,它将帮助您管理网络设计和运营各个领域的复杂性。
The chapter begins with a short review of complexity as a theory, and then reviews the three-part model used here to describe complexity. The final section develops a number of practical steps you can take on a day-to-day basis to manage complexity. The goal is not to provide the answers, but rather to provide the questions and framework that will help you find the answers for any given situation. The end goal is a set of tools, and a new way of looking at old tools, that will help you manage complexity in every area of network design and operation.
复杂性很难定义,至少部分是因为不同的工程师发现不同的事情很复杂——一些研究人员甚至认为复杂性只是一个“观察者问题”。与其提供精确的定义,不如考虑两个似乎总是与感知的复杂性相关的特定组件。可能无法将复杂性定义为单个语句,但可以以一种可识别的方式来描述复杂性。
Complexity is difficult to define, at least partly, because different engineers find different things complex—and some researchers even argue that complexity is just an “observer problem.” Rather than providing a precise definition, it’s better to consider two specific components that always appear to be related to perceived complexity. It might not be possible to define complexity down to a single statement, but it’s possible to describe complexity in a way that makes it recognizable.
任何难以理解的事情通常都会被认为是复杂的,无论事实是否如此。通常使复杂性难以定义的原因是,对一个人来说看似复杂的事情在另一个人看来很简单,或者观察者错过了关键事实、信息或培训而看似复杂的事情。然而,任何复杂性的定义都需要超越这个观察问题,并找到某种方式来表达“这很复杂”,无论任何特定的观察者是否认为它是复杂的。事实上,设计领域最难解决的问题之一是相信复杂的事情很简单,导致对真正存在的复杂性的关注不够。
Anything that is difficult to understand is often perceived as being complex, whether or not it really is. What often makes complexity difficult to define is that things that appear to be complex to one person are seen as simple by another—or things that appear to be complex by an observer missing key facts, pieces of information, or training. Any definition of complexity needs to reach beyond this observational problem, however, and find some way to say, “this is complex,” regardless of whether any particular observer believes that it is complex. In fact, one of the most difficult problems to solve in the world of design is the belief that complex things are simple, leading to an insufficient level of attention being paid to the complexity that is really there.
与其争论复杂性的出现应该定义复杂性,不如找到一些更客观的衡量标准,也许可以在复杂性涉及尝试解决实际上互斥的项目集的想法中找到客观衡量标准。那么,复杂的事情就涉及到悖论的解决方案,或者试图尽可能接近解决一系列没有最终解决方案的问题。解决方案越接近解决最终无法解决的问题集,解决方案就必须变得越复杂。
Rather than arguing that the appearance of complexity should define complexity, it is better to find some more objective measure—and perhaps that objective measure can be found in the idea that complexity involves attempting to solve for sets of items that are actually mutually exclusive. Things that are complex, then, involve solutions to paradoxes, or the attempt to come as close to resolving a set of problems for which there is no ultimate solution. The closer the solution tries to come to solving an ultimately unsolvable problem set, the more complex the solution must become.
对复杂性的第二个描述实际上与第一个相关。当遇到无法解决的问题时,工程师的自然倾向是继续尝试解决它。这项练习的结果是越来越多的意想不到的后果——原始设计中没有出现的潜在问题,但后来在意想不到的状态或情况下出现。当然,意想不到的后果也可能只是糟糕的系统设计的症状,或者系统设计已经远远超出了其最初的操作范围。但即使在设计良好的系统中,也会出现意想不到的后果。
This second description of complexity is actually related to the first. When faced with a problem that cannot be solved the natural tendency of an engineer is to go ahead and try to solve it. The result of this exercise is an increasing array of unintended consequences—latent problems that don’t appear in the original design, but show up later in unanticipated states or situations. Of course, unintended consequences might also just be a symptom of poor system design, or a system design that has been pushed far outside its original operating envelope. But even in well-designed systems, there will be unintended consequences.
最后,复杂的系统往往有大量相互作用的部分。这些部分如何交互以及这些交互与复杂性之间的关系将在本章前面的章节中详细介绍。
Finally, complex systems tend to have a large number of interacting parts. Just how these parts interact and the relationship between those interactions and complexity are details left to a section just ahead in this chapter.
虽然复杂性确实是对难题的任何响应的必然结果,但总是有可能出现实施不当、过于复杂的解决方案,这可能被称为不优雅、不可持续,甚至“它不会” t规模。” 虽然在极端情况下识别这种情况很容易,或者通过卡通描绘奇怪的机器通过长且不可重复的事件链提供简单的服务,但是可以使用什么措施在现实世界中找到这样的系统呢?有几项措施不言而喻。解决该问题的系统很可能过于复杂,但是:
While it is certainly true that complexity is a necessary result of any response to a hard problem, it is always possible to have a poorly implemented solution that is too complex—what might be called inelegant, unsustainable, or, even, “it won’t scale.” While it’s easy enough to recognize such situations in the extreme cases, or cartoon depictions of strange machines that provide simple services through long and unrepeatable chains of events, what measure can be used to find such systems in the real world? Several measures suggest themselves. A system that solves the problem is more than likely too complex, but:
• 维护系统的成本超过了简单处理问题的成本。
• The cost of maintaining the system overwhelms the cost of simply dealing with the problem in the first place.
• 最终系统的复杂性比简单地解决问题的复杂性还要糟糕。
• The complexity of the resulting system is worse than the complexity of simply working around the problem.
• 平均故障时间和/或平均修复时间太长,以至于系统的用户最终无法解决本应在很长一段时间内解决的问题(可用性太低,无法允许系统以满足业务需求)。
• The mean time to failure and/or the mean time to repair are so great that the users of the system end up having no way to address the problem that’s supposed to be solved for long stretches of time (the availability is too low to allow the system to meet business needs).
• 插入修复或更改会导致大量意想不到的后果,每个后果都难以理解和追踪根本原因,并且其中许多或部分后果在不了解所实施的修复实际上如何解决问题的情况下进行了修复。
• The insertion of repairs or changes causes a large number of unintended consequences, each of which is difficult to understand and trace to a root cause, and many or some of which are repaired without any understanding of how the implemented fix actually solved the problem.
在第一章“定义复杂性”中,图14.1中的图表用于说明鲁棒性和复杂性之间的关系。随着问题复杂性的增加,解决方案的复杂性也必须增加。在某些时候,解决方案的复杂性增加会停止增加稳健性。相反,解决方案复杂性的进一步增加实际上开始降低系统的稳健性。这可以被认为是解决方案变得过于复杂的点。
In the first chapter, “Defining Complexity,” the chart in Figure 14.1 was used to illustrate the relationship between robustness and complexity. As the complexity of the problem increases, the complexity of the solution must also increase. At some point, increasing complexity in the solution stops increasing robustness. Instead, further increases in the complexity of the solution actually begin to decrease the system’s robustness. This can be considered the point where the solution becomes too complex.
重要的是要记住,系统不一定创建得太复杂,而是随着时间的推移从某种更简单的状态转变为更复杂的状态。将复杂性增加的过程视为僵化是有用的;就像骨头或其他有机材料随着时间的推移会硬化成石头一样,灵活而坚固的设计可以通过随着时间的推移更换和添加不同的部件来硬化。这些替代品被视为强化了系统——事实上,它们确实强化了系统。然而,随着系统的硬化,它也会变得脆弱。结果是一个很难被破坏的系统,但一旦被破坏,它就会崩溃,而不是优雅地降级。
It’s important to remember that systems aren’t necessarily created too complex, but rather move from some simpler state to a more complex one over time. It’s useful to think of the process of increasing complexity as ossification; much like bones or other organic materials are hardened into stone over time, a flexible and robust design can be hardened through the replacement and addition of different pieces over time. These replacements are seen as hardening the system—and, in fact, they do harden the system. However, as the system is hardened, it also becomes brittle. The result is a system that is hard to break, but once it’s broken, it shatters, rather than degrading gracefully.
为什么不直接“解决”复杂性呢?为什么不找到某种方法来构建一个完全适合解决并不复杂的问题的系统呢?对此的简单答案是世界不是这样构建的。虽然可能存在一些哲学或“深层数学”原因导致这根本不可能,但原因并不重要,重要的是复杂性是一种权衡。对于任何给定的问题的三个相互关联的部分,您可以选择两个。图 14.2说明了这个概念。
Why not just “solve” complexity? Why not find some way to build a system with a perfect fit to the problem that simply isn’t complex? The simple answer to this is the world just isn’t built that way. While there may be some philosophical or “deep math” reason why this simply isn’t possible, the why doesn’t matter—what does matter is that complexity is a tradeoff. For any given set of three interrelated pieces of a problem, you can choose two. Figure 14.2 illustrates this concept.
There are a number of examples of this three-sided complexity problem, including:
• CAP 定理。任何给定的数据库都具有三个理想的特征:一致性、可访问性和可分区性。一致性是指任何给定用户对在任何给定时间点执行任何给定操作的结果。如果数据库系统是一致的,则任何两个(或更多)用户同时访问同一数据库将检索相同的信息,或者更确切地说,数据库的每个用户都具有相同的数据视图,无论何时检查。顾名思义,可访问性意味着任何用户随时访问数据库中的信息的能力。可分区性是指数据库被分区或分割到多个托管系统上的能力。数据库不可能同时最大程度地具备所有这三个属性。
• The CAP Theorem. There are three desirable traits for any given database: consistency, accessibility, and partitionability. Consistency refers to the results of any given operation by any given pair of users at any given point in time. If a database system is consistent, any two (or more) users accessing the same database at the same time will retrieve the same information—or rather, every user of the database has the same view of the data, no matter when it is checked. Accessibility means just what it sounds like, the ability of any user to access the information in the database at any time. Partitionability refers to the ability of the database to be partitioned, or split-up, onto multiple hosting systems. It is impossible for a database to have all three of these attributes to the maximum possible level at the same time.
•快速、便宜、高质量。几乎在每个领域,这都是一句老生常谈的格言:你可以选择建造廉价而高质量的东西,但这需要很长时间;你可以快速构建出高质量的东西,但成本会很高;或者你可以快速构建一些便宜的东西,但质量会受到影响。这适用于从软件到具体细节的一切。
• Quick, Cheap, High Quality. It’s a well-worn maxim in just about every area that you can choose to build something that’s cheap and of high quality, but it will take a long time; you can build something quickly that’s of high quality, but it will cost a lot; or you can build something quickly that’s cheap, but quality is going to suffer. This applies for everything from software to nuts and bolts.
•快速收敛、稳定、控制平面简单。在网络架构中,减少收敛时间总是要么导致控制平面复杂度增加,要么导致网络失去稳定性,更容易失败。
• Fast Convergence, Stability, Simple Control Plane. In network architecture, decreasing the convergence time will always either cause the complexity of the control plane to increase, or it will cause the network to lose stability so it is more likely to fail.
在工程(以及整个生活)的每个领域中,这“三组”的数量可能几乎是无限的。它们存在于许多不同的领域和领域,这应该使我们对它们的存在和参与我们所研究的每个工程问题保持警惕——不仅是显而易见的问题,而且还包括“更深层次”的问题。例如,上面列表中的最后一个示例并不是许多网络工程师在调整网络以实现高速收敛时倾向于考虑的一个,但它与列表中的前两个示例一样真实[事实上,它是第一点一致性、可访问性和可划分性 (CAP) 定理的结果或推论]。
There is probably a virtually infinite number of these “three sets” in every area of engineering (and life at large). That they exist in many different areas and realms should cause us to be alert to their existence and involvement in every engineering problem we work on—and not only the obvious ones, but also the “deeper” ones. For instance, the last example in the above list is not one many network engineers tend to think about when they’re tuning a network for high-speed convergence, but it’s just as real as the first two in the list [in fact, it’s a result of, or corollary to, the first point, the consistency, accessibility, and partitionability (CAP) theorem].
那么,复杂性永远是现实世界中的一种权衡。要记住的关键点是假设您正在做的任何事情都将涉及复杂性权衡。
Complexity, then, will always be a tradeoff in the real world. The key point to remember is to assume that anything you’re working on will involve a complexity tradeoff.
本书使用单一模型来解释、检查和理解复杂性:状态、速度和表面。在您所使用的每个系统中,这三者中的每一个都是相互关联的,因此对复杂性的这三个方面的深入理解对于管理现实世界中的复杂性非常有帮助。
This book has used a single model to explain, examine, and comprehend complexity: state, speed, and surface. Each of these three is interrelated to the other in every system you work on, so a solid understanding of these three aspects of complexity can be very helpful in managing complexity in the real world.
•状态:单个组件或系统内携带、保存或以其他方式管理的信息量。在本书中,状态主要与控制平面中携带的信息量相关,但它可以与几乎任何类型的信息相关,包括 TCP 中的窗口状态或在路由协议,用于在启动对等会话之前确保双向连接。
• State: The amount of information carried, held, or otherwise managed within a single component or system. Within this book, the state has primarily related to the amount of information carried in the control plane, but it could relate to just about any type of information, including things like the state of the window in TCP, or the neighbor list used in a routing protocol to ensure two-way connectivity before bringing up a peering session.
•速度:控制平面中携带的状态发生变化的速率。在本书中,这与拓扑变化的速率有关,驱动控制平面携带的可达性信息的变化。然而,还有许多其他示例,例如通过网络传输的数据变化的速率,或者安全凭证超时且必须更换的速率。
• Speed: The rate at which the state being carried in the control plane changes. Throughout this book this has been related to the rate at which the topology changes, driving changes in the reachability information carried by the control plane. There are many other examples, however, such as the rate at which the data being carried through a network changes, or the rate at which security credentials time out and must be replaced.
•表面:表面可以描述为系统的任何两个子系统(或任何两个相互作用的系统)之间的界面的深度和宽度。两个系统共享状态的次数越多,或者一个系统依赖于对另一个系统所包含的状态的深入理解,两个系统之间的交互就越深入。两个系统接触或交互的点越多,两个系统之间的交互面就越宽。更广泛和更深的交互表面导致一个系统中的状态与另一个系统中的状态更紧密地联系在一起,从而增加了一个系统中基于另一个系统中的状态变化而产生意外后果的可能性。随着交互表面深度和广度的增加,理解和解释任何给定状态变化的每种可能后果变得更加困难。这只是一个阶乘或数组乘法问题。增加元素会增加可能的组合数量,从而增加可能的最终状态的数量。在某些时候,不可能知道或测试每种组合。此外,如果每个系统或子系统使用某种形式的抽象来隐藏其与特定信息相关的内部操作,那么问题就会变得几乎无限复杂。
• Surface: The surface can be described as the depth and width of the interface between any two subsystems of a system (or any two interacting systems). The more two systems share state, or one system relies on a deep understanding of the state contained in another system, the deeper the interaction between the two systems. The more points at which two systems touch, or interact, the broader the interaction surface between the two systems. Broader and deeper interaction surfaces cause the state in one system to be more closely (tightly) tied to the state in another system, thus increasing the likelihood of unintended consequences in one system based on a state change in another. As the interaction surface increases in depth and breadth, the more difficult it becomes to understand and account for every possible consequence of any given state change. This is simply a factorial or array multiplication problem. Increasing the elements increases the number of possible combinations, and hence the number of possible final states. At some point, it’s impossible to know or test for every combination. Further, if each system or subsystem uses some form of abstraction to hide its internal actions in relation to a specific piece of information, the problem becomes almost infinitely complex.
虽然状态、速度、表面和复杂性的模型并不完全适合图 14.3中所示的三元组模型,但它确实提供了一种解决网络设计中从协议到策略的复杂性问题的可靠方法。
While the model of state, speed, surface, and complexity doesn’t entirely fit into the triplet model shown in Figure 14.3, it does provide a solid way of approaching the complexity problem in network design, from protocols to policy.
工程师可以针对速度、状态或表面进行优化。以网络收敛速度为例:
Engineers can either optimize for speed, state, or surface. Consider network convergence speed as an example:
•速度优化:可以通过增加控制平面中的状态来优化网络的收敛速度,例如,通过在控制平面中计算和携带远程LFA,网络几乎可以立即收敛到任何拓扑变化。然而,远程 LFA 的计算和承载代表了通过网络承载的状态的增加。提高收敛速度的另一种选择可能是在网络中任何给定链路的第 2 层(或物理)状态与同一链路的第 3 层(或逻辑)状态之间建立更紧密的链路。然而,这必然会增加网络这两层之间的交互面,从而增加复杂性。
• Optimize for Speed: The network can be optimized for speed of convergence by increasing the state in the control plane—for instance, by calculating and carrying remote LFAs in the control plane, the network can converge almost instantly to any topology change. The calculation and carrying of remote LFAs, however, represents an increase in the state carried through the network. Another option to increase the speed of convergence might be to build a tighter link between the layer 2 (or physical) state of any given link in the network and the layer 3 (or logical) state of that same link. This, however, will necessarily increase the interaction surface between these two layers of the network, and thus increase complexity.
•优化状态:可以优化网络,使控制平面承载尽可能少的状态。例如,控制平面可以配置为仅在需要时发现并缓存可达性,而不是在拓扑发生变化时(近乎实时)。然而,这最终会通过断开控制平面的拓扑视图与拓扑的真实状态来降低网络收敛的速度(用CAP定理术语来说,就是用一致性换取可访问性),从而减慢网络的收敛速度。当控制平面被优化以最小化状态时,控制平面和物理层(网络的实际拓扑)之间的交互表面也变得更加交织。
• Optimize for State: The network can be optimized so the control plane carries the minimal state possible. For instance, the control plane can be configured to discover and cache reachability only when it’s needed, rather than when the topology changes (in near real time). However, this will ultimately reduce the speed at which the network converges by disconnecting the control plane’s view of the topology from the real state of the topology (in CAP theorem terms, trading consistency for accessibility), thus slowing the network’s convergence down. The interaction surface between the control plane and the physical layer (the actual topology of the network) also becomes more intertwined when the control plane is optimized to minimize state.
•针对表面进行优化:可以在严格的层中构建网络,其中任何层都不会与任何其他层交互,除非以非常少的方式进行交互。例如,位于物理层之上的逻辑层可以设计为完全忽略物理层报告的任何错误,从而将两层之间的交互表面减少到只不过是封装(并且也许是地址映射)。然而,这会减慢收敛速度,导致网络运行可能不理想,并增加上层逻辑层正常运行所需的状态量。
• Optimize for Surface: It’s possible to build a network in strict layers, where no layer interacts with any other layer except in very minimal ways. For instance, a logical layer riding on top of a physical layer can be designed to completely ignore any errors reported by the physical layer, thus reducing the interaction surface between the two layers to nothing more than encapsulation (and perhaps address mapping). However, this will slow down the speed of convergence, resulting in potentially suboptimal network operation, and increase the amount of state the overlying logical layer needs to carry to properly operate.
即使在状态/速度/表面模型中,在设计网络时也存在需要管理和考虑的三方面问题。
Even within the state/speed/surface model, then, there is a three-sided problem that needs to be managed and considered when designing a network.
网络工程师面临着复杂的情况——如果不处理复杂的解决方案就不可能解决难题,但复杂性本身就是一个无法解决的问题。管理复杂性是否只是“向风车倾斜”(或者“向完美安全倾斜”)?不。网络工程师需要学会在日常设计工作中管理复杂性,而不是放弃该领域,或者干脆忽视复杂性。最后一章的最后一节为此目的提供了一些想法和技巧。
Network engineers are in a complex situation—it’s impossible to resolve hard problems without dealing with complex solutions, and yet complexity, itself, is an unsolvable problem. Is managing complexity simply “tilting at windmills” (or perhaps “tilting at perfect security”)? No. Rather than abandon the field, or simply ignore complexity, network engineers need to learn to manage complexity in every day design work. This final section of this final chapter provides some thoughts and tips toward this end.
你跑不了,躲也躲不了;甚至尝试也没有意义。你会遇到复杂的情况;忽略它并不会让问题消失,只会让问题在网络的某个角落恶化。您今天造成的复杂性问题将在短短几年内再次困扰您。引用一位多年来研究复杂性的人的话:
You can’t run, and you can’t hide; there’s no point in even trying. You’re going to encounter complexity; ignoring it doesn’t make the problem go away, it just allows the problem to fester under some rug in some corner of your network. The complexity problems you create today will come back to haunt you in just a few years. To quote someone who’s spent years looking at complexity:
尝试对可预测问题进行网络证明往往会使其在处理不可预测问题时变得脆弱(通过您提到的僵化效应)。给予同一个网络尽可能强大的能力来防御不可预测的问题,必然意味着它不能对可预测的问题过于鲁棒(对可预测的问题不要太鲁棒是避免僵化问题所必需的,但不一定足以提供处理不可预测的网络问题的强大能力。
Trying to make a network proof against predictable problems tends to make it fragile in dealing with unpredictable problems (through an ossification effect as you mentioned). Giving the same network the strongest possible ability to defend itself against unpredictable problems, it necessarily follows, means that it MUST NOT be too terribly robust against predictable problems (not being too robust against predictable problems is necessary to avoid the ossification issue, but not necessarily sufficient to provide for a robust ability to handle unpredictable network problems.
——托尼·普齐吉安达
—Tony Przygienda
凌晨 2 点的通话可能并不令人愉快,但以错误的方式解决问题可能会在以后的某个时间导致凌晨 2 点发生更糟糕的通话。强化网络以应对所有故障最终意味着当发生您没有预料到的故障时,网络会严重失败——这是无法回避的现实。
That call at 2AM might not be pleasant, but solving it the wrong way might cause a much worse call at 2AM sometime down the road. Hardening the network against all failures eventually means to make it fail spectacularly when a failure is hit that you didn’t predict—there’s just no way around this reality.
因此,在处理工程问题时,对可以解决和不能解决的问题保持一点谦逊是有必要的。不要忽视复杂性,但也不要认为你可以解决它。相反,请记住将每种情况都视为一组权衡,如果您没有看到权衡,那么您就没有足够努力地寻找。
When dealing with engineering problems, then, a little humility around what can, and cannot, be solved is in order. Don’t ignore complexity, but don’t think you can solve it, either. Instead, remember to treat every situation as a set of tradeoffs—and if you don’t see the tradeoffs, you’re not looking hard enough.
网络工程师使用的许多模型实际上只是控制复杂性的方法。例如,实际上没有什么可以将网络设计限制为分层模式(事实上,设计领域还有许多其他模式可供借鉴)。那么为什么工程师如此一致地使用分层模型呢?因为它提供了一个容器,解决了设计具有如此多移动部件的大型系统所涉及的复杂性。就像只穿黑色和灰色衣服的工程师,或者穿着“自我强加的制服”的人一样,预先做出一些决定有助于更快地做出其他决定。
Many of the models used by network engineers are really just ways to contain complexity. For instance, there is nothing that actually constrains network designs into a hierarchical pattern (in fact, there are a number of other patterns out there to draw from in the area of design). So why do engineers use the hierarchical model so consistently? Because it provides a container around the complexity involved in designing a large-scale system with so many moving parts. Like the engineer who only wears black and gray, or the person who wears a “self-imposed uniform,” premaking some decisions aids in making other decisions more quickly.
因此,在处理复杂性时,您应该采取的第一个行动是找到一个可以包含复杂性的模型。无论是分层模型、流量模型、模块化模型还是其他模型,找到一个包含复杂性的模型将帮助您抽象出特定的组件,并开始理解整个系统。
So the first action you should take when dealing with complexity is to find a model that can contain the complexity. Whether it’s a layering model, a traffic flow model, a modularization model, or some other model, finding a model to contain the complexity will help you to abstract out specific components, and start to understand the system as a whole.
不过,请注意,不要停留在找到的第一个模型上。相反,寻找并使用尽可能多的模型来描述任何特定的系统。您使用的每个模型实际上都是问题的抽象,而抽象本质上是有漏洞的。每个模型——每个抽象——都会有漏洞,但它会以自己的方式出现漏洞。虽然您永远无法使用单个模型甚至模型集合来完全描述任何复杂的系统,但您与系统交互的模型越多,您对整个系统及其行为的理解就越多。
A word of caution, however—don’t stop at the first model you find. Instead, seek out and use as many models as you can to describe any specific system. Each model you use is actually an abstraction of the problem—and abstractions are leaky by their very nature. Each model—each abstraction—will be leaky, but it will be leaky in its own way. While you can never completely describe any complex system using a single model, nor even a collection of models, the more models you interact with in relation to a system the more you will understand the system and its behavior as a whole.
一些可能的模型可能是:
Some possible models might be:
•数据包流模型。网络工程师通常考虑控制平面状态,而不是数据包流。彻底改变这种思路通常很有用——从任何特定设备开始,研究将使用哪些链路、将通过哪些队列以及将流量发送到任何其他设备时会遇到哪些中间框。当服务在整个网络中虚拟化时,这尤其有用。
• A Packet Flow Model. Network engineers normally think in terms of control plane state, rather than packet flows. It’s often useful to turn this line of thinking on its head—starting with any particular device, work through what links would be used, what queues would be passed through, and what middle boxes would be encountered when sending traffic to any other device. This is particularly useful when services are virtualized throughout the network.
•以国家为中心的模型。询问每条信息如何通过网络承载,承载信息需要什么状态,每条状态源自何处,以及它如何到达使用点。例如,要交换特定的数据包,路由器必须有一些可达性信息、有关需要使用哪个下一跳到达目的地的一些信息以及有关如何封装数据包的一些信息。它还可能需要有关如何在数据通过路由器传输时对数据进行排队的信息,以及可能需要某种类型的特殊处理指令。所有这些信息从哪里来?什么进程在路由器内提供信息,该进程从哪里获取此信息,此信息源自何处,其更改频率如何等等?
• A State-Focused Model. Ask how each piece of information is carried through the network, what state is necessary to carry the information, where each piece of state originates, and how it reaches the point where it is used. For instance, to switch a specific packet, a router must have some reachability information, some information about which next hop it needs to use to reach the destination, and some information on how to encapsulate the packet. It might also need information on how to queue the data while it’s in transit through the router, and possibly special handling instructions of some type. Where does all this information come from? What process provides the information within the router, where does that process get this information, where does this information originate, how often does it change, etc.?
•以API 为中心的模型。布置系统的各个组件,然后描述每对组件之间的交互表面。这将使您很好地了解每对组件之间交互表面的深度和广度。完成配对后,仔细考虑任何三向交互界面等。
• An API-Focused Model. Lay out the various components of the system, and then describe the interaction surface between each of these pairs of components. This will give you a good idea of the depth and breadth of the interaction surface between each pair of components. After you’re done with the pairs, think through any three-way interaction surfaces, etc.
•以政策为中心的模型。实际上,分层网络设计模型是一个以策略为中心的模型(给定聚合是应用于控制平面状态的策略的一种形式)。政策适用于何处,为什么?该策略的应用对流经网络的流量有何影响?它会使交通流变得更优化还是更少?每个策略以及网络中可以应用该策略的每个点的权衡是什么?
• A Policy-Focused Model. The hierarchical network design model is, in reality, a policy-focused model (given aggregation is a form of policy applied to control plane state). Where is policy applied, and why? What impact does the application of this policy have on the traffic flowing through the network? Does it make the traffic flow more optimally, or less? What are the tradeoffs for each policy, and each point within the network where that policy can be applied?
思考回答这些问题的模型将帮助您了解实际系统中面临的复杂性。
Thinking through models that answer each of these questions will help you wrap your head around the complexity you face in real systems.
如果您已经读到这一部分,那么您已经阅读了许多非常难以理解的文本,这些文本非常注重理论(如果您跳过了前面,那就太丢脸了——回去像真正的工程师一样阅读!) 。有时,似乎很难看出如何将网络工程的理论方面应用到现实世界,但联系确实存在。学习理论将使您能够更快地看到问题,提出更好的问题,知道要寻找什么(以及如何寻找),更快地掌握新技术,并知道如何评估技术。
If you’ve reached this section, you’ve read through a lot of very difficult to understand text that’s very focused on theory (if you’ve skipped ahead, shame on you—go back and do your reading like a real engineer!). It might, at times, seem difficult to see how to apply the theoretical side of network engineering to the real world—but the connection is definitely there. Learning the theory will enable you to see problems more quickly, ask better questions, know what to look for (and how to look), pick up new technologies more quickly, and know how to evaluate technologies.
在现实的工程世界中,没有站在彩虹下的独角兽。只有权衡——复杂性的权衡,您现在知道如何更成功地进行导航。
There are, in the real world of engineering, no unicorns standing under rainbows to be found. There are only tradeoffs—tradeoffs in complexity that you now know how to navigate more successfully.
抽象
abstraction
漏水。查看 泄漏的抽象
leaky. See leaky abstractions
编排系统,263
orchestration system, 263
聚合
aggregation
概念,83
concept, 83
as leaky abstraction, 125, 126
API。请参阅 应用程序编程接口 (API)
API. See Application Programming Interface (API)
apocryphal story of the engineer and the hammer, 65–66
应用层
application layer
四层模型,147
four-layer model, 147
七层模型,145
seven-layer model, 145
Application Programming Interface (API), 9, 143
专用集成电路(ASIC),235
Application Specific Integrated Circuit (ASIC), 235
申请支持、政策和,218
application support, policy and, 218
网络架构的艺术(思科出版社),44
The Art of Network Architecture (Cisco Press), 44
专用集成电路。请参阅 专用集成电路 (ASIC)
ASIC. See Application Specific Integrated Circuit (ASIC)
异步传输模式(ATM),106
Asynchronous Transfer Mode (ATM), 106
自动提款机。请参阅 异步传输模式 (ATM)
ATM. See Asynchronous Transfer Mode (ATM)
审计,云提供商,274
auditing, cloud provider, 274
自动化
automation
网络状态的抽象,71
abstractions of network state, 71
as solution to management complexity, 69–72
可用性
availability
与。冗余,90
vs. redundancy, 90
bandwidth calendaring, 188, 193–196
带宽利用率,193
bandwidth utilization, 193
边界网关协议。请参阅 边界网关协议 (BGP)
BGP. See Border Gateway Protocol (BGP)
边界网关协议 (BGP),9
Border Gateway Protocol (BGP), 9
携带的信息量,32
amount of information carried in, 32
公式为,35
formula for, 35
propagating a new destination, 35–36
分层,在 OSPF 上,6
layering, on OSPF, 6
埃里克·布鲁尔,21 岁
Brewer, Eric, 21
业务连续性、政策和,218
business continuity, policy and, 218
business drivers, programmable networks, 186–188
capacity vs. growth over time, 239, 240
资本支出(CAPEX),188
Capital Expenditures (CAPEX), 188
集权
centralization
廉价的解决方案。查看 快速、廉价且高质量的难题
cheap solution. See quick, cheap, and high quality conundrum
阻塞点,134
choke points, 134
全秉坤,53 岁
Chun, Byung-Gon, 53
思科技术支持中心,235
Cisco Technical Assistance Center, 235
克拉克,亚瑟·C.,3
Clarke, Arthur C., 3
清洁溶液,3
clean solution, 3
CLNP。请参阅 无连接网络协议 (CLNP)
CLNP. See Connectionless Networking Protocol (CLNP)
查尔斯·克洛斯,114
Clos, Charles, 114
云计算,267
cloud computing, 267
云服务
cloud services
cloud centric deployment, 269–270
以网络为中心的模型,271
network centric model, 271
概述,267
overview, 267
凝聚力原则,263
cohesion principle, 263
command-line interfaces (CLIs), 192–193
negative responses to, 2
numbers of moving parts, 5–9, 280
积极回应,2
positive responses to, 2
reactions to, 2
第281章
robustness and, 281
路由协议交互,6
routing protocols interaction, 6
unintended consequences, 11–13, 280
复杂的解决方案,3
complex solution, 3
计算,272
computing, 272
无连接网络协议(CLNP),143
Connectionless Networking Protocol (CLNP), 143
继续分段路由,247
continue in segment routing, 247
控制器到控制器故障域,229
controller to controller failure domain, 229
控制平面
control plane
故障域,229
failure domain, 229
政策,219
policy, 219
控制平面复杂性、虚拟化、109
control plane complexity, virtualization, 109
控制平面状态vs控制平面状态 拉伸,81 – 87。另请参阅 拉伸
control plane state vs. stretch, 81–87. See also stretch
coupling and failure domains, 257–260
数据库
database
无障碍, 21
accessible, 21
一致,21
consistent, 21
容忍分区,22
tolerating partitions, 22
数据中心 (DC) 边缘交换机,243
Data Center (DC) edge switch, 243
data center failure domains, 228–229
数据链路层,144
data link layer, 144
数据模型,204
data models, 204
数据可移植性,276
data portability, 276
数据科学,55
data science, 55
数据科学家,55 岁
data scientist, 55
decentralization vs. centralization, 189–190
默认自由区,37
default free zone, 37
拒绝服务 (DoS) 攻击,198
Denial of Service (DoS) attacks, 198
depth of interaction surfaces, 39–40
指定路由器 (DR),140
Designated Router (DR), 140
桌面虚拟化 (DV),273
Desktop Virtualization (DV), 273
Diffusing Update Algorithm (DUAL), 160, 161
直接针对服务器的攻击,198
direct to server attacks, 198
距离矢量协议。另请参见 增强型内部网关路由协议 (EIGRP)
distance vector protocols. See also Enhanced Interior Gateway Routing Protocol (EIGRP)
携带的信息量,32
amount of information carried in, 32
环形拓扑,88
ring topologies, 88
分布式系统,21
distributed systems, 21
博士。请参阅 指定路由器 (DR)
DR. See Designated Router (DR)
双重的。请参阅 扩散更新算法 (DUAL)
DUAL. See Diffusing Update Algorithm (DUAL)
吉尔·戴奇,187
Dyche, Jill, 187
ebb and flow of centralization, 188–191
EIGRP。请参阅 增强型内部网关路由协议 (EIGRP)
EIGRP. See Enhanced Interior Gateway Routing Protocol (EIGRP)
终端系统(ES),138
End Systems (ES), 138
end-to-end principle, 216. See also subsidiarity principle
Enhanced Interior Gateway Routing Protocol (EIGRP)
概念,95
concept, 95
large-scale environments, 161–162
large-scale manual deployments, 63–64
卡在活动进程中,162
stuck in active process, 162
卡在活动计时器中,161
stuck in active timer, 161
error detection, parity bit and, 17–18
可扩展标记语言(XML)。参见 XML
eXtensible Markup Language (XML). See XML
故障域
failure domains
控制器到控制器,229
controller to controller, 229
控制平面,229
control plane, 229
and information hiding, 126–128
programmable networks, 226–230
失效理论,163
failure theory, 163
快速收敛,283
fast convergence, 283
反馈循环,164
feedback loops, 164
negative feedback loop, 164–166
positive feedback loop, 166–174
FIB。请参阅 转发信息库 (FIB)
FIB. See Forwarding Information Base (FIB)
固定长度编码,OSPF,139
fixed length encoding, OSPF, 139
固定数据包,OSPF,139
fixed packets, OSPF, 139
重点分层功能,分层模型,133
focused layered functionality, hierarchical models, 133
重点政策点,分层模型,134
focused policy points, hierarchical models, 134
Forwarding Information Base (FIB), 200. See also interfaces, programmable networks
转发平面复杂度,109
forwarding plane complexity, 109
应用层,147
application layer, 147
互联网层,147
Internet layer, 147
链路层,146
link layer, 146
传输层,147
transport layer, 147
functional separation, virtualization, 108–109
加西亚,JJ,152
Garcia, J. J., 152
戈德堡、鲁布. 参见 鲁布·戈德堡机器
Goldberg, Rube. See Rube Goldberg Machine
图形用户界面,192
Graphical User Interfaces, 192
growth vs. capacity over time, 239, 240
黑客, 4
hack, 4
硬件抽象层,116
hardware abstraction layer, 116
隐藏信息。查看 信息隐藏
hiding information. See information hiding
基本设计,133
basic design, 133
重点分层功能,133
focused layered functionality, 133
重点政策点,134
focused policy points, 134
模块化,并且,121
modularity and, 121
速度, 43
speed, 43
州,43
state, 43
表面, 43
surfaces, 43
超文本标记语言 (HTML)。参见 HTML
HypterText Markup Language (HTML). See HTML
I2RS。请参阅 路由系统接口 (I2RS)
I2RS. See Interface to the Routing System (I2RS)
基础设施即服务。查看 基础设施即服务 (IaaS)
IaaS. See infrastructure as a service (IaaS)
IGRP。请参阅 内部网关路由协议 (IGRP)
IGRP. See Interior Gateway Routing Protocol (IGRP)
信息模型,204
information models, 204
基础设施即服务 (IaaS),272
infrastructure as a service (IaaS), 272
内部攻击,197
insider attack, 197
情报/分析,272
intelligence/analytics, 272
intelligent timers, convergence with, 96–99
模块化,以及,122
modularization and, 122
overlapping interactions, 40–41
interfaces, programmable networks, 200–201
FIB,200
FIB, 200
包含的信息, 200
information contained by, 200
北行,200
northbound, 200
切换路径,201
switching path, 201
Interface to the Routing System (I2RS), 210–212
异步、过滤事件,212
asynchronous, filtered events, 212
分布式路由协议,212
distributed routing protocols, 212
短暂状态,212
ephemeral state, 212
multiple simultaneous asynchronous operations, 211–212
Interior Gateway Routing Protocol (IGRP), 63–64
中间系统(IS),138
Intermediate Systems (IS), 138
中间系统到中间系统 (IS-IS),9
Intermediate System-to-Intermediate System (IS-IS), 9
信息编码,140
information encoding, 140
互联网层,147
Internet layer, 147
互联网研究工作组,51
Internet Research Task Force, 51
库存、接口和,200
inventory, interfaces and, 200
乔尔谈软件,117
Joel on Software, 117
埃迪·科勒,53 岁
Kohler, Eddie, 53
标签分发协议(LDP),158
Label Distribution Protocol (LDP), 158
标签交换路径 (LSP),75
Label Switched Path (LSP), 75
落后。请参阅 链路聚合组 (LAG)
LAG. See Link Aggregation Groups (LAG)
law of leaky abstractions, 117. See also leaky abstractions
protocol complexity vs., 141–149
seven-layer model, 43–44, 143–146
自民党。请参阅 标签分发协议 (LDP)
LDP. See Label Distribution Protocol (LDP)
leaky abstractions, 76, 117–118
LFA。请参阅 无循环替代 (LFA)
LFA. See Loop Free Alternates (LFA)
李托尼,185
Li, Tony, 185
链路聚合组 (LAG),93
Link Aggregation Groups (LAG), 93
链路层,四层模型,146
link layer, four-layer model, 146
Link State Advertisement (LSA), 95, 140
链路状态数据包 (LSP),139
Link State Packet (LSP), 139
链路状态协议。另请参见 中间系统到中间系统 (IS-IS);开放最短路径优先 (OSPF)
link state protocols. See also Intermediate System-to-Intermediate System (IS-IS); Open Shortest Path First (OSPF)
携带的信息量,33
amount of information carried in, 33
control plane traffic flooding, 92–93
环形拓扑,89
ring topologies, 89
Linux 内核代码,261
Linux kernel code, 261
本地信息、本地控制、216
local information, local control, 216
Loop Free Alternates (LFA), 152–154
LSA。请参阅 链路状态通告 (LSA)
LSA. See Link State Advertisement (LSA)
automation as solution to, 69–72
modularity as solution to, 72–74
protocol complexity vs., 74–76
最大传输单元 (MTU),139
Maximum Transmission Unit (MTU), 139
Mean Time Between Mistakes (MTBM), 63, 187
modeling design complexity, 51–53
Network Complexity Index, 49–51
微循环
microloops
LFA。请参阅 无循环替代 (LFA)
LFA. See Loop Free Alternates (LFA)
最小路由通告间隔。请参阅 MRAI(最小路由通告间隔)
minimum route advertisement interval. See MRAI (minimum route advertisement interval)
modeling design complexity, 51–53. See also measuring complexity
interchangeable modules, 120–121
as solution to management complexity, 72–74
均匀性。查看 均匀度
uniformity. See uniformity
单一栽培失败,114
monoculture failures, 114
多协议标签交换 (MPLS), 106 , 118 , 119 – 120 , 157
MRAI (minimum route advertisement interval), 35, 36
MTBM。查看 平均错误间隔时间 (MTBM)
MTBM. See Mean Time Between Mistakes (MTBM)
多协议标签交换 (MPLS)。参见 MPLS
Multiprotocol Label Switching (MPLS). See MPLS
mutual redistribution, 171–172
negative feedback loop, 164–166
NetComplex, 53–55. See also measuring complexity
Network Address Translation (NAT), 234, 247
以网络为中心的模型,271
network centric model, 271
Network Complexity Index, 49–51. See also measuring complexity
网络复杂性研究小组,51
Network Complexity Research Group, 51
网络融合。参见 收敛
network convergence. See convergence
Network Function Virtualization (NFV), 234–241
网络时间协议 (NTP),8
Network Time Protocol (NTP), 8
网络翻译,234
Network Translation, 234
网络功能虚拟化。请参阅 网络功能虚拟化 (NFV)
NFV. See Network Function Virtualization (NFV)
北向接口,200个
northbound interfaces, 200
国家时间规划。请参阅 网络时间协议 (NTP)
NTP. See Network Time Protocol (NTP)
NVGRE,244
NVGRE, 244
晦涩的代码,3
obscure code, 3
奥卡姆剃刀,10
Occam’s Razor, 10
Open Shortest Path First (OSPF), 9, 95
聚合可达性,139
aggregated reachability, 139
BGP 分层开启,6
BGP layering on, 6
event-driven detection and, 45–46
固定长度编码,139
fixed length encoding, 139
固定数据包,139
fixed packets, 139
目标, 139
goals, 139
信息编码,140
information encoding, 140
开放系统互连 (OSI) 协议栈,138
Open Systems Interconnect (OSI) protocol stack, 138
运营支出 (OPEX),62
Operational Expenditures (OPEX), 62
optimal traffic handling vs. policy dispersion, 66–69
州和,254
state and, 254
定时器驱动的检测,45
timer-driven detection, 45
orchestration effect, 262–263. See also virtualization
organized complexity, 55–57. See also measuring complexity
overlapping interactions, 40–41. See also interaction surfaces
PaaS。查看 平台即服务 (PaaS)
PaaS. See platform as a service (PaaS)
数据包分类,256
packet classification, 256
packet loops, 168–170. See also positive feedback loop
奇偶校验位
parity bit
数据包处理,18
packet processing, 18
路径计算客户端(PCC),207
Path Computation Client (PCC), 207
Path Computation Element Protocol (PCEP), 207–210
OpenFlow对比,209
OpenFlow vs., 209
路径向量收敛,26 – 28。另请参见 边界网关协议 (BGP)
path vector convergence, 26–28. See also Border Gateway Protocol (BGP)
个人计算机委员会。请参阅 路径计算客户端 (PCC)
PCC. See Path Computation Client (PCC)
PCEP。请参阅 路径计算元素协议 (PCEP)
PCEP. See Path Computation Element Protocol (PCEP)
物理层,144
physical layer, 144
别针。查看 网络中的位置 (PIN)
PIN. See Places in the Network (PIN)
PIX-PL,235
PIX-PL, 235
Places in the Network (PIN), 131–132
平台即服务 (PaaS),273
platform as a service (PaaS), 273
政策,82
policy, 82
business requirements and, 217–218
控制平面,219
control plane, 219
基于策略的路由,82
Policy Based Routing, 82
policy complexity, programmable network, 223–224
policy consistency, programmable network, 222–223
政策分散
policy dispersion
vs. optimal traffic handling, 66–69
programmable network complexity, 220–222
positive feedback loop, 166–168
相互再分配,171
mutual redistribution, 171
可预测性,262
predictability, 262
表示层,145
presentation layer, 145
human interaction with system, 61–63
policy dispersion vs. optimal traffic handling, 66–69
troubleshooting network failure, 65–66
可编程网络
programmable networks
ebb and flow of centralization, 188–191
software-defined perimeter, 196–199
协议复杂度
protocol complexity
design complexity vs., 149–162
layering vs., 141–149. See also layering
management complexity vs., 74–76
优质的解决方案。查看 快速、廉价且高质量的难题
quality solution. See quick, cheap, and high quality conundrum
quick, cheap, and high quality conundrum, 20–21, 283
西尔维娅·拉特纳萨米,53 岁
Ratnasamy, Sylvia, 53
redistribution between two routing protocols, 171–172
redundancy vs. resilience, 90–93
雷赫特,雅科夫,185
Rekhter, Yakov, 185
具象状态传输 (REST) 接口,207
REpresentational State Transfer (REST) interface, 207
resilience, redundancy vs., 90–93
资源预留协议 (RSVP),17
Resource Reservation Protocol (RSVP), 17
休息会议,207
RESTCONF, 207
投资回报率,218
return on investment, 218
肋骨。请参阅 路由信息库 (RIB)
RIB. See Routing Information Base (RIB)
距离矢量协议,88
distance vector protocol, 88
快速重新路由,155
fast reroute, 155
链路状态协议,89
link state protocol, 89
使用、原因、89
using, reasons for, 89
安息吧。请参阅 路由信息协议 (RIP)
RIP. See Routing Information Protocol (RIP)
稳健性和复杂性,281
robustness and complexity, 281
坚固但脆弱,19
robust yet fragile, 19
根本原因分析、故障排除和,65
root cause analysis, troubleshooting and, 65
Routing Information Base (RIB), 200, 201
路由信息协议(RIP),95
Routing Information Protocol (RIP), 95
敬请回复。请参阅 资源预留协议 (RSVP)
RSVP. See Resource Reservation Protocol (RSVP)
SaaS。查看 软件即服务 (SaaS)
SaaS. See software as a service (SaaS)
萨尔策,JH,216
Saltzer, J. H., 216
scale out vs. scale up, 238–239
安全信息、政策和,218
secure information, policy and, 218
安全
security
作为城堡,196
as castle, 196
交叉污染,275
cross contamination, 275
加密,276
encryption, 276
physical media management, 275–276
虚拟拓扑,107
virtual topologies, 107
继续,247
continue in, 247
例如,246
example, 246
MPLS LSP,248
MPLS LSPs, 248
数据包,247
packet, 247
交通, 247
traffic, 247
自记录代码,4
self-documenting code, 4
州和,253
state and, 253
Service Function Chaining (SFC), 243–245
建筑, 244
architecture of, 244
描述,243
description, 243
例如,244
example, 244
规格,244
specifications, 244
会话层,145
session layer, 145
seven-layer model, 43–44, 143–146
应用层,145
application layer, 145
数据链路层,144
data link layer, 144
以太网,145
Ethernet, 145
知识产权,145
IP, 145
物理层,144
physical layer, 144
表示层,145
presentation layer, 145
会话层,145
session layer, 145
TCP,146
TCP, 146
传输层,145
transport layer, 145
证监会。请参阅 服务功能链 (SFC)
SFC. See Service Function Chaining (SFC)
命运共同体
shared fate
共享风险链接小组( SRLG )、111、177、178
Shared Risk Link Groups (SRLG), 111, 177, 178
最短路径树(SPT),95
Shortest Path Tree (SPT), 95
单一责任原则,263
single responsibility principle, 263
软件即服务 (SaaS),273
software as a service (SaaS), 273
软件定义网络(SDN),185
software-defined networks (SDN), 185
software-defined perimeter, 196–199
软件环境,272
software environment, 272
网络中的源数据包路由 (SPRING), 245 , 246
Source Packet Routing in the Network (SPRING), 245, 246
意大利面条代码,261
spaghetti code, 261
spaghetti transport system, 118–120
Spanning Tree failure, 174–175
network never converging, 35–37
乔尔·斯波尔斯基,117
Spolsky, Joel, 117
SPT。请参阅 最短路径树 (SPT)
SPT. See Shortest Path Tree (SPT)
和优化,254
and optimization, 254
和服务链,253
and service chaining, 253
与。伸展, 87
vs. stretch, 87
拉伸vs拉伸 查看 拉伸
stretch vs. See stretch
静态,35
static state, 35
存储, 272
storage, 272
存储即服务,272
storage as a service, 272
概念, 81
concept, 81
control plane state vs., 81–87
增加, 82
increasing, 82
测量, 82
measuring, 82
州vs. , 87
state vs., 87
陷入 EIGRP 的活动进程,162
stuck in active process, of EIGRP, 162
卡在活动计时器、EIGRP、161中
stuck in active timer, EIGRP, 161
subsidiarity principle, 216–217
超级碗(1996 年),11
Super Bowl (1996), 11
surface, 38–41. See also interaction surfaces
政策互动,255
policy interaction, 255
TCP synchronization, shared fate problem as, 179–180
技术债务,13
technical debt, 13
遥测,200
telemetry, 200
临时修复,65
temporary fix, 65
临时解决方法,4
temporary workaround, 4
three-dimensional Mandelbulb, 5–6
重击, 116
thunk, 116
定时器驱动的检测,45
timer-driven detection, 45
计时器,收敛于
timers, convergence with
拓扑结构
topology
aggregation of information, 124–125
接口,200
interfaces and, 200
环形拓扑。查看 环形拓扑
ring topologies. See ring topologies
vs. speed of convergence, 88–94
ToR 开关,245
ToR switch, 245
“迈向稳健的分布式系统”(Brewer),21
“Towards Robust Distributed Systems” (Brewer), 21
Transmission Control Protocol (TCP), 7, 8
内部状态,8
internal state, 8
NTP 和,8
NTP and, 8
传输层
transport layer
四层模型,147
four-layer model, 147
七层模型,145
seven-layer model, 145
troubleshooting, virtualization, 260–262
troubleshooting network failure, 65–66
问题识别,65
problem identification, 65
问题修复,65
problem remediation, 65
根本原因分析,65
root cause analysis, 65
TTL,76
TTL, 76
tunneled fast reroute mechanisms, 103–104
tunneling protocols, SFC, 244–245
类型-长度-值 (TLV), 4 , 139 , 140 – 141
Type-Length-Values (TLV), 4, 139, 140–141
编码, 14
encodings, 14
标头,15
header, 15
在线带宽,17
on-the-wire bandwidth, 17
vs. optimally structured packet formats, 15–16
UDP。请参阅 用户数据报协议 (UDP)
UDP. See User Datagram Protocol (UDP)
Unified Modeling Language (UML), 134–136
控制和管理,116
control and management, 116
供应商, 114
vendor, 114
unintended consequences, 11–13, 280
用户数据报协议(UDP),135
User Datagram Protocol (UDP), 135
供应商,统一,114
vendor, uniformity, 114
供应商锁定,114
vendor lock-in, 114
虚拟可扩展局域网 (VXLAN)。请参见 VXLAN
Virtual Extensible Local Area Network (VXLAN). See VXLAN
控制平面复杂度,109
control plane complexity, 109
design complexity vs., 106–111
design considerations, 256–262
转发平面复杂度,109
forwarding plane complexity, 109
functional separation, 108–109
服务链。请参阅 服务链
service chaining. See service chaining
SRLG,111
SRLG, 111
状态和优化,254
state and optimization, 254
状态和服务链,253
state and service chaining, 253
表面和政策互动,255
surface and policy interaction, 255
surface and policy proxies, 255–256
虚拟交换机(VSwitch),245
virtual switch (VSwitch), 245
VXLAN 、118、119、244 _ _ _
沃伦·韦弗,56 岁
Weaver, Warren, 56
web application, UML for, 134–136
加权随机早期检测(WRED),180
Weighted Random Early Detection (WRED), 180
wide area failure domains, 227–228
WRED。请参阅 加权随机早期检测 (WRED)
WRED. See Weighted Random Early Detection (WRED)
XML,205
XML, 205
规格,204
specification, 204